# HG changeset patch # User bgruening # Date 1381174472 14400 # Node ID 9790cfb46d03c694a19f080a9bda6a9d5ed420c0 Uploaded diff -r 000000000000 -r 9790cfb46d03 COPYING --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/COPYING Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,674 @@ + GNU GENERAL PUBLIC LICENSE + Version 3, 29 June 2007 + + Copyright (C) 2007 Free Software Foundation, Inc. + Everyone is permitted to copy and distribute verbatim copies + of this license document, but changing it is not allowed. + + Preamble + + The GNU General Public License is a free, copyleft license for +software and other kinds of works. + + The licenses for most software and other practical works are designed +to take away your freedom to share and change the works. By contrast, +the GNU General Public License is intended to guarantee your freedom to +share and change all versions of a program--to make sure it remains free +software for all its users. We, the Free Software Foundation, use the +GNU General Public License for most of our software; it applies also to +any other work released this way by its authors. You can apply it to +your programs, too. + + When we speak of free software, we are referring to freedom, not +price. Our General Public Licenses are designed to make sure that you +have the freedom to distribute copies of free software (and charge for +them if you wish), that you receive source code or can get it if you +want it, that you can change the software or use pieces of it in new +free programs, and that you know you can do these things. + + To protect your rights, we need to prevent others from denying you +these rights or asking you to surrender the rights. Therefore, you have +certain responsibilities if you distribute copies of the software, or if +you modify it: responsibilities to respect the freedom of others. + + For example, if you distribute copies of such a program, whether +gratis or for a fee, you must pass on to the recipients the same +freedoms that you received. You must make sure that they, too, receive +or can get the source code. And you must show them these terms so they +know their rights. + + Developers that use the GNU GPL protect your rights with two steps: +(1) assert copyright on the software, and (2) offer you this License +giving you legal permission to copy, distribute and/or modify it. + + For the developers' and authors' protection, the GPL clearly explains +that there is no warranty for this free software. For both users' and +authors' sake, the GPL requires that modified versions be marked as +changed, so that their problems will not be attributed erroneously to +authors of previous versions. + + Some devices are designed to deny users access to install or run +modified versions of the software inside them, although the manufacturer +can do so. This is fundamentally incompatible with the aim of +protecting users' freedom to change the software. The systematic +pattern of such abuse occurs in the area of products for individuals to +use, which is precisely where it is most unacceptable. Therefore, we +have designed this version of the GPL to prohibit the practice for those +products. If such problems arise substantially in other domains, we +stand ready to extend this provision to those domains in future versions +of the GPL, as needed to protect the freedom of users. + + Finally, every program is threatened constantly by software patents. +States should not allow patents to restrict development and use of +software on general-purpose computers, but in those that do, we wish to +avoid the special danger that patents applied to a free program could +make it effectively proprietary. To prevent this, the GPL assures that +patents cannot be used to render the program non-free. + + The precise terms and conditions for copying, distribution and +modification follow. + + TERMS AND CONDITIONS + + 0. Definitions. + + "This License" refers to version 3 of the GNU General Public License. + + "Copyright" also means copyright-like laws that apply to other kinds of +works, such as semiconductor masks. + + "The Program" refers to any copyrightable work licensed under this +License. Each licensee is addressed as "you". "Licensees" and +"recipients" may be individuals or organizations. + + To "modify" a work means to copy from or adapt all or part of the work +in a fashion requiring copyright permission, other than the making of an +exact copy. The resulting work is called a "modified version" of the +earlier work or a work "based on" the earlier work. + + A "covered work" means either the unmodified Program or a work based +on the Program. + + To "propagate" a work means to do anything with it that, without +permission, would make you directly or secondarily liable for +infringement under applicable copyright law, except executing it on a +computer or modifying a private copy. Propagation includes copying, +distribution (with or without modification), making available to the +public, and in some countries other activities as well. + + To "convey" a work means any kind of propagation that enables other +parties to make or receive copies. Mere interaction with a user through +a computer network, with no transfer of a copy, is not conveying. + + An interactive user interface displays "Appropriate Legal Notices" +to the extent that it includes a convenient and prominently visible +feature that (1) displays an appropriate copyright notice, and (2) +tells the user that there is no warranty for the work (except to the +extent that warranties are provided), that licensees may convey the +work under this License, and how to view a copy of this License. If +the interface presents a list of user commands or options, such as a +menu, a prominent item in the list meets this criterion. + + 1. Source Code. + + The "source code" for a work means the preferred form of the work +for making modifications to it. "Object code" means any non-source +form of a work. + + A "Standard Interface" means an interface that either is an official +standard defined by a recognized standards body, or, in the case of +interfaces specified for a particular programming language, one that +is widely used among developers working in that language. + + The "System Libraries" of an executable work include anything, other +than the work as a whole, that (a) is included in the normal form of +packaging a Major Component, but which is not part of that Major +Component, and (b) serves only to enable use of the work with that +Major Component, or to implement a Standard Interface for which an +implementation is available to the public in source code form. A +"Major Component", in this context, means a major essential component +(kernel, window system, and so on) of the specific operating system +(if any) on which the executable work runs, or a compiler used to +produce the work, or an object code interpreter used to run it. + + The "Corresponding Source" for a work in object code form means all +the source code needed to generate, install, and (for an executable +work) run the object code and to modify the work, including scripts to +control those activities. However, it does not include the work's +System Libraries, or general-purpose tools or generally available free +programs which are used unmodified in performing those activities but +which are not part of the work. For example, Corresponding Source +includes interface definition files associated with source files for +the work, and the source code for shared libraries and dynamically +linked subprograms that the work is specifically designed to require, +such as by intimate data communication or control flow between those +subprograms and other parts of the work. + + The Corresponding Source need not include anything that users +can regenerate automatically from other parts of the Corresponding +Source. + + The Corresponding Source for a work in source code form is that +same work. + + 2. Basic Permissions. + + All rights granted under this License are granted for the term of +copyright on the Program, and are irrevocable provided the stated +conditions are met. This License explicitly affirms your unlimited +permission to run the unmodified Program. The output from running a +covered work is covered by this License only if the output, given its +content, constitutes a covered work. This License acknowledges your +rights of fair use or other equivalent, as provided by copyright law. + + You may make, run and propagate covered works that you do not +convey, without conditions so long as your license otherwise remains +in force. You may convey covered works to others for the sole purpose +of having them make modifications exclusively for you, or provide you +with facilities for running those works, provided that you comply with +the terms of this License in conveying all material for which you do +not control copyright. Those thus making or running the covered works +for you must do so exclusively on your behalf, under your direction +and control, on terms that prohibit them from making any copies of +your copyrighted material outside their relationship with you. + + Conveying under any other circumstances is permitted solely under +the conditions stated below. Sublicensing is not allowed; section 10 +makes it unnecessary. + + 3. Protecting Users' Legal Rights From Anti-Circumvention Law. + + No covered work shall be deemed part of an effective technological +measure under any applicable law fulfilling obligations under article +11 of the WIPO copyright treaty adopted on 20 December 1996, or +similar laws prohibiting or restricting circumvention of such +measures. + + When you convey a covered work, you waive any legal power to forbid +circumvention of technological measures to the extent such circumvention +is effected by exercising rights under this License with respect to +the covered work, and you disclaim any intention to limit operation or +modification of the work as a means of enforcing, against the work's +users, your or third parties' legal rights to forbid circumvention of +technological measures. + + 4. Conveying Verbatim Copies. + + You may convey verbatim copies of the Program's source code as you +receive it, in any medium, provided that you conspicuously and +appropriately publish on each copy an appropriate copyright notice; +keep intact all notices stating that this License and any +non-permissive terms added in accord with section 7 apply to the code; +keep intact all notices of the absence of any warranty; and give all +recipients a copy of this License along with the Program. + + You may charge any price or no price for each copy that you convey, +and you may offer support or warranty protection for a fee. + + 5. Conveying Modified Source Versions. + + You may convey a work based on the Program, or the modifications to +produce it from the Program, in the form of source code under the +terms of section 4, provided that you also meet all of these conditions: + + a) The work must carry prominent notices stating that you modified + it, and giving a relevant date. + + b) The work must carry prominent notices stating that it is + released under this License and any conditions added under section + 7. This requirement modifies the requirement in section 4 to + "keep intact all notices". + + c) You must license the entire work, as a whole, under this + License to anyone who comes into possession of a copy. This + License will therefore apply, along with any applicable section 7 + additional terms, to the whole of the work, and all its parts, + regardless of how they are packaged. This License gives no + permission to license the work in any other way, but it does not + invalidate such permission if you have separately received it. + + d) If the work has interactive user interfaces, each must display + Appropriate Legal Notices; however, if the Program has interactive + interfaces that do not display Appropriate Legal Notices, your + work need not make them do so. + + A compilation of a covered work with other separate and independent +works, which are not by their nature extensions of the covered work, +and which are not combined with it such as to form a larger program, +in or on a volume of a storage or distribution medium, is called an +"aggregate" if the compilation and its resulting copyright are not +used to limit the access or legal rights of the compilation's users +beyond what the individual works permit. Inclusion of a covered work +in an aggregate does not cause this License to apply to the other +parts of the aggregate. + + 6. Conveying Non-Source Forms. + + You may convey a covered work in object code form under the terms +of sections 4 and 5, provided that you also convey the +machine-readable Corresponding Source under the terms of this License, +in one of these ways: + + a) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by the + Corresponding Source fixed on a durable physical medium + customarily used for software interchange. + + b) Convey the object code in, or embodied in, a physical product + (including a physical distribution medium), accompanied by a + written offer, valid for at least three years and valid for as + long as you offer spare parts or customer support for that product + model, to give anyone who possesses the object code either (1) a + copy of the Corresponding Source for all the software in the + product that is covered by this License, on a durable physical + medium customarily used for software interchange, for a price no + more than your reasonable cost of physically performing this + conveying of source, or (2) access to copy the + Corresponding Source from a network server at no charge. + + c) Convey individual copies of the object code with a copy of the + written offer to provide the Corresponding Source. This + alternative is allowed only occasionally and noncommercially, and + only if you received the object code with such an offer, in accord + with subsection 6b. + + d) Convey the object code by offering access from a designated + place (gratis or for a charge), and offer equivalent access to the + Corresponding Source in the same way through the same place at no + further charge. You need not require recipients to copy the + Corresponding Source along with the object code. If the place to + copy the object code is a network server, the Corresponding Source + may be on a different server (operated by you or a third party) + that supports equivalent copying facilities, provided you maintain + clear directions next to the object code saying where to find the + Corresponding Source. Regardless of what server hosts the + Corresponding Source, you remain obligated to ensure that it is + available for as long as needed to satisfy these requirements. + + e) Convey the object code using peer-to-peer transmission, provided + you inform other peers where the object code and Corresponding + Source of the work are being offered to the general public at no + charge under subsection 6d. + + A separable portion of the object code, whose source code is excluded +from the Corresponding Source as a System Library, need not be +included in conveying the object code work. + + A "User Product" is either (1) a "consumer product", which means any +tangible personal property which is normally used for personal, family, +or household purposes, or (2) anything designed or sold for incorporation +into a dwelling. In determining whether a product is a consumer product, +doubtful cases shall be resolved in favor of coverage. For a particular +product received by a particular user, "normally used" refers to a +typical or common use of that class of product, regardless of the status +of the particular user or of the way in which the particular user +actually uses, or expects or is expected to use, the product. A product +is a consumer product regardless of whether the product has substantial +commercial, industrial or non-consumer uses, unless such uses represent +the only significant mode of use of the product. + + "Installation Information" for a User Product means any methods, +procedures, authorization keys, or other information required to install +and execute modified versions of a covered work in that User Product from +a modified version of its Corresponding Source. The information must +suffice to ensure that the continued functioning of the modified object +code is in no case prevented or interfered with solely because +modification has been made. + + If you convey an object code work under this section in, or with, or +specifically for use in, a User Product, and the conveying occurs as +part of a transaction in which the right of possession and use of the +User Product is transferred to the recipient in perpetuity or for a +fixed term (regardless of how the transaction is characterized), the +Corresponding Source conveyed under this section must be accompanied +by the Installation Information. But this requirement does not apply +if neither you nor any third party retains the ability to install +modified object code on the User Product (for example, the work has +been installed in ROM). + + The requirement to provide Installation Information does not include a +requirement to continue to provide support service, warranty, or updates +for a work that has been modified or installed by the recipient, or for +the User Product in which it has been modified or installed. Access to a +network may be denied when the modification itself materially and +adversely affects the operation of the network or violates the rules and +protocols for communication across the network. + + Corresponding Source conveyed, and Installation Information provided, +in accord with this section must be in a format that is publicly +documented (and with an implementation available to the public in +source code form), and must require no special password or key for +unpacking, reading or copying. + + 7. Additional Terms. + + "Additional permissions" are terms that supplement the terms of this +License by making exceptions from one or more of its conditions. +Additional permissions that are applicable to the entire Program shall +be treated as though they were included in this License, to the extent +that they are valid under applicable law. If additional permissions +apply only to part of the Program, that part may be used separately +under those permissions, but the entire Program remains governed by +this License without regard to the additional permissions. + + When you convey a copy of a covered work, you may at your option +remove any additional permissions from that copy, or from any part of +it. (Additional permissions may be written to require their own +removal in certain cases when you modify the work.) You may place +additional permissions on material, added by you to a covered work, +for which you have or can give appropriate copyright permission. + + Notwithstanding any other provision of this License, for material you +add to a covered work, you may (if authorized by the copyright holders of +that material) supplement the terms of this License with terms: + + a) Disclaiming warranty or limiting liability differently from the + terms of sections 15 and 16 of this License; or + + b) Requiring preservation of specified reasonable legal notices or + author attributions in that material or in the Appropriate Legal + Notices displayed by works containing it; or + + c) Prohibiting misrepresentation of the origin of that material, or + requiring that modified versions of such material be marked in + reasonable ways as different from the original version; or + + d) Limiting the use for publicity purposes of names of licensors or + authors of the material; or + + e) Declining to grant rights under trademark law for use of some + trade names, trademarks, or service marks; or + + f) Requiring indemnification of licensors and authors of that + material by anyone who conveys the material (or modified versions of + it) with contractual assumptions of liability to the recipient, for + any liability that these contractual assumptions directly impose on + those licensors and authors. + + All other non-permissive additional terms are considered "further +restrictions" within the meaning of section 10. If the Program as you +received it, or any part of it, contains a notice stating that it is +governed by this License along with a term that is a further +restriction, you may remove that term. If a license document contains +a further restriction but permits relicensing or conveying under this +License, you may add to a covered work material governed by the terms +of that license document, provided that the further restriction does +not survive such relicensing or conveying. + + If you add terms to a covered work in accord with this section, you +must place, in the relevant source files, a statement of the +additional terms that apply to those files, or a notice indicating +where to find the applicable terms. + + Additional terms, permissive or non-permissive, may be stated in the +form of a separately written license, or stated as exceptions; +the above requirements apply either way. + + 8. Termination. + + You may not propagate or modify a covered work except as expressly +provided under this License. Any attempt otherwise to propagate or +modify it is void, and will automatically terminate your rights under +this License (including any patent licenses granted under the third +paragraph of section 11). + + However, if you cease all violation of this License, then your +license from a particular copyright holder is reinstated (a) +provisionally, unless and until the copyright holder explicitly and +finally terminates your license, and (b) permanently, if the copyright +holder fails to notify you of the violation by some reasonable means +prior to 60 days after the cessation. + + Moreover, your license from a particular copyright holder is +reinstated permanently if the copyright holder notifies you of the +violation by some reasonable means, this is the first time you have +received notice of violation of this License (for any work) from that +copyright holder, and you cure the violation prior to 30 days after +your receipt of the notice. + + Termination of your rights under this section does not terminate the +licenses of parties who have received copies or rights from you under +this License. If your rights have been terminated and not permanently +reinstated, you do not qualify to receive new licenses for the same +material under section 10. + + 9. Acceptance Not Required for Having Copies. + + You are not required to accept this License in order to receive or +run a copy of the Program. Ancillary propagation of a covered work +occurring solely as a consequence of using peer-to-peer transmission +to receive a copy likewise does not require acceptance. However, +nothing other than this License grants you permission to propagate or +modify any covered work. These actions infringe copyright if you do +not accept this License. Therefore, by modifying or propagating a +covered work, you indicate your acceptance of this License to do so. + + 10. Automatic Licensing of Downstream Recipients. + + Each time you convey a covered work, the recipient automatically +receives a license from the original licensors, to run, modify and +propagate that work, subject to this License. You are not responsible +for enforcing compliance by third parties with this License. + + An "entity transaction" is a transaction transferring control of an +organization, or substantially all assets of one, or subdividing an +organization, or merging organizations. If propagation of a covered +work results from an entity transaction, each party to that +transaction who receives a copy of the work also receives whatever +licenses to the work the party's predecessor in interest had or could +give under the previous paragraph, plus a right to possession of the +Corresponding Source of the work from the predecessor in interest, if +the predecessor has it or can get it with reasonable efforts. + + You may not impose any further restrictions on the exercise of the +rights granted or affirmed under this License. For example, you may +not impose a license fee, royalty, or other charge for exercise of +rights granted under this License, and you may not initiate litigation +(including a cross-claim or counterclaim in a lawsuit) alleging that +any patent claim is infringed by making, using, selling, offering for +sale, or importing the Program or any portion of it. + + 11. Patents. + + A "contributor" is a copyright holder who authorizes use under this +License of the Program or a work on which the Program is based. The +work thus licensed is called the contributor's "contributor version". + + A contributor's "essential patent claims" are all patent claims +owned or controlled by the contributor, whether already acquired or +hereafter acquired, that would be infringed by some manner, permitted +by this License, of making, using, or selling its contributor version, +but do not include claims that would be infringed only as a +consequence of further modification of the contributor version. For +purposes of this definition, "control" includes the right to grant +patent sublicenses in a manner consistent with the requirements of +this License. + + Each contributor grants you a non-exclusive, worldwide, royalty-free +patent license under the contributor's essential patent claims, to +make, use, sell, offer for sale, import and otherwise run, modify and +propagate the contents of its contributor version. + + In the following three paragraphs, a "patent license" is any express +agreement or commitment, however denominated, not to enforce a patent +(such as an express permission to practice a patent or covenant not to +sue for patent infringement). To "grant" such a patent license to a +party means to make such an agreement or commitment not to enforce a +patent against the party. + + If you convey a covered work, knowingly relying on a patent license, +and the Corresponding Source of the work is not available for anyone +to copy, free of charge and under the terms of this License, through a +publicly available network server or other readily accessible means, +then you must either (1) cause the Corresponding Source to be so +available, or (2) arrange to deprive yourself of the benefit of the +patent license for this particular work, or (3) arrange, in a manner +consistent with the requirements of this License, to extend the patent +license to downstream recipients. "Knowingly relying" means you have +actual knowledge that, but for the patent license, your conveying the +covered work in a country, or your recipient's use of the covered work +in a country, would infringe one or more identifiable patents in that +country that you have reason to believe are valid. + + If, pursuant to or in connection with a single transaction or +arrangement, you convey, or propagate by procuring conveyance of, a +covered work, and grant a patent license to some of the parties +receiving the covered work authorizing them to use, propagate, modify +or convey a specific copy of the covered work, then the patent license +you grant is automatically extended to all recipients of the covered +work and works based on it. + + A patent license is "discriminatory" if it does not include within +the scope of its coverage, prohibits the exercise of, or is +conditioned on the non-exercise of one or more of the rights that are +specifically granted under this License. You may not convey a covered +work if you are a party to an arrangement with a third party that is +in the business of distributing software, under which you make payment +to the third party based on the extent of your activity of conveying +the work, and under which the third party grants, to any of the +parties who would receive the covered work from you, a discriminatory +patent license (a) in connection with copies of the covered work +conveyed by you (or copies made from those copies), or (b) primarily +for and in connection with specific products or compilations that +contain the covered work, unless you entered into that arrangement, +or that patent license was granted, prior to 28 March 2007. + + Nothing in this License shall be construed as excluding or limiting +any implied license or other defenses to infringement that may +otherwise be available to you under applicable patent law. + + 12. No Surrender of Others' Freedom. + + If conditions are imposed on you (whether by court order, agreement or +otherwise) that contradict the conditions of this License, they do not +excuse you from the conditions of this License. If you cannot convey a +covered work so as to satisfy simultaneously your obligations under this +License and any other pertinent obligations, then as a consequence you may +not convey it at all. For example, if you agree to terms that obligate you +to collect a royalty for further conveying from those to whom you convey +the Program, the only way you could satisfy both those terms and this +License would be to refrain entirely from conveying the Program. + + 13. Use with the GNU Affero General Public License. + + Notwithstanding any other provision of this License, you have +permission to link or combine any covered work with a work licensed +under version 3 of the GNU Affero General Public License into a single +combined work, and to convey the resulting work. The terms of this +License will continue to apply to the part which is the covered work, +but the special requirements of the GNU Affero General Public License, +section 13, concerning interaction through a network will apply to the +combination as such. + + 14. Revised Versions of this License. + + The Free Software Foundation may publish revised and/or new versions of +the GNU General Public License from time to time. Such new versions will +be similar in spirit to the present version, but may differ in detail to +address new problems or concerns. + + Each version is given a distinguishing version number. If the +Program specifies that a certain numbered version of the GNU General +Public License "or any later version" applies to it, you have the +option of following the terms and conditions either of that numbered +version or of any later version published by the Free Software +Foundation. If the Program does not specify a version number of the +GNU General Public License, you may choose any version ever published +by the Free Software Foundation. + + If the Program specifies that a proxy can decide which future +versions of the GNU General Public License can be used, that proxy's +public statement of acceptance of a version permanently authorizes you +to choose that version for the Program. + + Later license versions may give you additional or different +permissions. However, no additional obligations are imposed on any +author or copyright holder as a result of your choosing to follow a +later version. + + 15. Disclaimer of Warranty. + + THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY +APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT +HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY +OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, +THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR +PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM +IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF +ALL NECESSARY SERVICING, REPAIR OR CORRECTION. + + 16. Limitation of Liability. + + IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING +WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS +THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY +GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE +USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF +DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD +PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), +EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF +SUCH DAMAGES. + + 17. Interpretation of Sections 15 and 16. + + If the disclaimer of warranty and limitation of liability provided +above cannot be given local legal effect according to their terms, +reviewing courts shall apply local law that most closely approximates +an absolute waiver of all civil liability in connection with the +Program, unless a warranty or assumption of liability accompanies a +copy of the Program in return for a fee. + + END OF TERMS AND CONDITIONS + + How to Apply These Terms to Your New Programs + + If you develop a new program, and you want it to be of the greatest +possible use to the public, the best way to achieve this is to make it +free software which everyone can redistribute and change under these terms. + + To do so, attach the following notices to the program. It is safest +to attach them to the start of each source file to most effectively +state the exclusion of warranty; and each file should have at least +the "copyright" line and a pointer to where the full notice is found. + + + Copyright (C) + + This program is free software: you can redistribute it and/or modify + it under the terms of the GNU General Public License as published by + the Free Software Foundation, either version 3 of the License, or + (at your option) any later version. + + This program is distributed in the hope that it will be useful, + but WITHOUT ANY WARRANTY; without even the implied warranty of + MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + GNU General Public License for more details. + + You should have received a copy of the GNU General Public License + along with this program. If not, see . + +Also add information on how to contact you by electronic and paper mail. + + If the program does terminal interaction, make it output a short +notice like this when it starts in an interactive mode: + + Copyright (C) + This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. + This is free software, and you are welcome to redistribute it + under certain conditions; type `show c' for details. + +The hypothetical commands `show w' and `show c' should show the appropriate +parts of the General Public License. Of course, your program's commands +might be different; for a GUI interface, you would use an "about box". + + You should also get your employer (if you work as a programmer) or school, +if any, to sign a "copyright disclaimer" for the program, if necessary. +For more information on this, and how to apply and follow the GNU GPL, see +. + + The GNU General Public License does not permit incorporating your program +into proprietary programs. If your program is a subroutine library, you +may consider it more useful to permit linking proprietary applications with +the library. If this is what you want to do, use the GNU Lesser General +Public License instead of this License. But first, please read +. diff -r 000000000000 -r 9790cfb46d03 create_index.py --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/create_index.py Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,15 @@ +#!/usr/bin/env python + +import os +import sys + +o = open( sys.argv[1], 'w+' ) + + +o.write('

InterProScan result summary page

    ' ) + +for filename in [f for f in os.listdir( sys.argv[2] ) if os.path.isfile( os.path.join( sys.argv[2], f) )]: + o.write( '
  • %s
  • ' % ( filename, os.path.splitext( filename )[0] ) ) + +o.write( '
' ) +o.close() diff -r 000000000000 -r 9790cfb46d03 prinseq-graphs-noPCA.pl --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/prinseq-graphs-noPCA.pl Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,2655 @@ +#!/usr/bin/perl + +#=============================================================================== +# Author: Robert SCHMIEDER, Computational Science Research Center @ SDSU, CA +# +# File: prinseq-graphs +# Date: 2012-12-22 +# Version: 0.6 graphs +# +# Usage: +# prinseq-graphs [options] +# +# Try 'prinseq-graphs-noPCA -h' for more information. +# +# Purpose: PRINSEQ will help you to preprocess your genomic or metagenomic +# sequence data in FASTA or FASTQ format. The graphs version allows +# users of the lite version to generate graphs similar to the web +# version. +# +# Bugs: Please use http://sourceforge.net/tracker/?group_id=315449 +# +#=============================================================================== + +use strict; +use warnings; + +use Getopt::Long; +use Pod::Usage; +use File::Temp qw(tempfile); #for output files +use Fcntl qw(:flock SEEK_END); #for log file +use Cwd; +use JSON; +use Cairo; +#use Statistics::PCA; +use MIME::Base64; +use File::Basename; +use Data::Dumper; ### + +$| = 1; # Do not buffer output + +my $PI = 4 * atan2(1, 1); +my $LOG62 = log(62); +my $DINUCODDS_VIR = [ + [qw(1.086940308 0.98976932 1.034167044 0.880024041 1.070421277 0.990687084 0.890945575 1.069957074 0.92465631 0.803973303)], + [qw(1.101064857 0.986812783 1.038299155 0.896162618 1.081652847 0.976365237 0.867445186 1.06727283 0.94688543 0.768007295)], + [qw(1.071548411 0.912204166 1.196914981 0.80628184 1.294201511 1.148517794 0.269295791 1.033948026 0.895951033 0.623192149)], + [qw(1.090253719 0.907428629 1.203991784 0.786359294 1.281499107 1.145421568 0.235974709 1.033437274 0.899580091 0.631699771)], + [qw(1.075864745 1.003413074 1.01872902 0.897841689 0.980373171 1.05854979 0.934262259 1.052477953 0.88145851 0.889239724)], + [qw(1.101890467 1.030028291 1.019912674 0.84191395 1.0015174 1.069546264 0.900151602 0.996269395 0.889195343 0.904039022)], + [qw(1.152417359 0.855028574 0.91164793 1.017415486 1.114163672 1.128353311 0.846355573 0.916745489 1.206820475 0.811014651)], + [qw(1.142454218 0.8635465 0.923406967 1.026242747 1.134445058 1.131747833 0.79793368 0.920767641 1.179468556 0.799770057)], + [qw(1.124462747 0.873556143 0.945627041 1.013755408 1.159866153 1.096259526 0.757315047 0.972924919 1.105562567 0.772731886)], + [qw(1.143826972 0.866968779 0.995740249 0.945859278 1.109590621 1.089305083 0.76048874 0.971561388 1.157101408 0.792923027)], + [qw(1.131900141 0.82776996 0.996204924 0.999433455 1.024692372 1.071176333 0.921026216 1.088936699 1.054010776 0.773498892)], + [qw(1.042180476 0.930180412 1.019242897 0.98909997 1.006666828 1.046708539 0.959492164 1.011183418 1.055168776 0.937433818)], + [qw(1.086515695 0.985345815 0.930914307 0.969581792 1.043010232 1.087463712 0.939482285 0.990551965 0.954752469 0.893972874)], + [qw(1.096657826 0.950117614 0.936195529 0.965619788 1.114975275 1.077011195 0.843153131 0.989128406 1.043790912 0.840634731)], + [qw(1.158030995 0.935307365 0.874812261 1.056236525 1.117171274 0.937484692 1.057442372 0.970079538 1.174848738 0.725071711)], + [qw(1.15591506 0.93000227 0.883538923 1.0567652 1.095730954 0.944489906 1.074229471 0.983993745 1.156051409 0.726688465)], + [qw(1.205726473 0.924439339 1.049457756 0.805718412 0.975472778 1.07581991 0.726992211 1.075025787 0.8704929 0.726672843)], + [qw(1.188544681 0.95239611 1.049066985 0.790031334 1.038632598 1.056749787 0.665197397 1.057566244 0.862429061 0.708982398)], + [qw(1.063631482 0.925593715 1.014869316 0.944904401 1.119690731 1.325971834 0.273781451 0.943347677 1.06438014 0.920825904)], + [qw(1.077560287 0.911888545 1.044147857 0.927758054 1.058535939 1.296838544 0.421514996 0.945722451 1.128317986 0.926419928)], + [qw(1.163753415 0.989905668 0.893599328 0.955641844 1.176047687 0.941559156 0.950641089 0.959741692 1.100815282 0.72491925)], + [qw(1.139253929 0.946297517 0.922096125 1.024801537 1.205206793 0.968818717 0.915801342 0.971626058 1.107569276 0.627623404)] + ]; +my $DINUCODDS_MIC = [ + [qw(1.13127323 0.853587195 0.911041047 1.104520778 1.065586428 1.021434164 0.999734139 1.063684014 1.078035184 0.733596552)], + [qw(1.173267344 0.840539337 0.919534602 1.068050141 1.062394214 1.051999071 0.96770576 1.035511729 1.095600433 0.72328141)], + [qw(1.172939786 0.84567902 0.911836259 1.106288994 1.05351787 1.026143368 1.002308358 1.066319771 1.094918797 0.710733535)], + [qw(1.073527689 0.850290918 0.978455025 1.080882178 1.111174765 1.010754115 0.895668707 1.072980666 1.079304608 0.754057386)], + [qw(1.08807747 0.837444678 0.95824965 1.097310298 1.118897971 1.030863881 0.886827263 1.072349394 1.07406322 0.733440096)], + [qw(1.071685485 0.861055813 0.966566865 1.090268118 1.112945761 1.012538936 0.909535491 1.063745603 1.071156598 0.755770377)], + [qw(1.142698587 0.867936867 1.000612099 0.977934257 1.111801746 1.018318601 0.788556794 0.987763594 1.184649653 0.784776176)], + [qw(1.134560074 0.876651844 0.998190253 0.995723123 1.128448077 1.014172324 0.781776188 0.971020602 1.182411449 0.786449476)], + [qw(1.180029632 0.787899325 1.01316945 0.932268406 1.077837263 1.211699678 0.612128817 1.033036699 1.157314398 0.74940288)], + [qw(1.160925546 0.788308899 1.003702496 0.965371236 1.076051693 1.188304271 0.641536444 1.070331188 1.124067192 0.740126813)], + [qw(1.173873006 0.790118011 1.014718833 0.937979878 1.07453725 1.207167373 0.622279064 1.046150047 1.145627707 0.742212886)], + [qw(1.128383111 0.870541389 0.987269741 0.98353238 1.115643879 1.040107028 0.774505865 1.010896432 1.164757274 0.775254395)], + [qw(1.15297511 0.853883985 0.956393231 1.000027661 1.139915472 1.01355294 0.838843622 1.015553125 1.216219741 0.70447264)], + [qw(1.148264236 0.852123859 0.974568293 0.985455546 1.13192373 1.015879393 0.828987111 1.016820786 1.216647853 0.71634006)], + [qw(1.12933788 0.831777975 1.005434367 0.991081409 1.126146895 1.07421504 0.69343913 1.054032466 1.14809591 0.728541157)], + [qw(1.124157235 0.828112691 1.022348424 0.983822386 1.143028487 1.081830005 0.672594435 1.05685982 1.149537403 0.684432106)], + [qw(1.128029586 0.841853305 1.00983936 0.967179139 1.122524003 1.094555807 0.659238308 1.061578854 1.1243601 0.740148171)], + [qw(1.093521636 0.855071052 0.929160818 1.203773691 1.178257185 0.881341255 1.078305505 1.051988532 1.169143967 0.555057308)], + [qw(1.073737278 0.877396537 0.968017446 1.124155374 1.166244435 0.909044208 0.999147578 1.071098934 1.120156138 0.607444953)], + [qw(1.092150184 0.863407008 0.927040387 1.185387013 1.171670826 0.882276859 1.083058605 1.048379554 1.168635365 0.580337997)] + ]; +my $DATA_VIR = [ + [2,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [3,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [42,2,'Human (nasal)',[127/255, 127/255, 255/255,1]], + [43,2,'Human (nasal)',[127/255, 127/255, 255/255,1]], + [45,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [49,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [52,3,'Human (sputum)',[127/255, 127/255, 255/255,1]], + [54,3,'Human (sputum)',[127/255, 127/255, 255/255,1]], + [55,4,'Human (sputum, CF)',[127/255, 127/255, 255/255,1]], + [57,4,'Human (sputum, CF)',[127/255, 127/255, 255/255,1]], + [88,5,'Freshwater (Hot spring)',[127/255, 127/255, 255/255,1]], + [89,5,'Freshwater (Hot spring)',[127/255, 127/255, 255/255,1]], + [98,6,'Freshwater (Antartic lake)',[127/255, 127/255, 255/255,1]], + [99,6,'Freshwater (Antartic lake)',[127/255, 127/255, 255/255,1]], + [100,7,'Freshwater (reclaimed)',[127/255, 127/255, 255/255,1]], + [102,7,'Freshwater (reclaimed)',[127/255, 127/255, 255/255,1]], + [153,8,'Mouse (brain tissue)',[127/255, 127/255, 255/255,1]], + [154,8,'Mouse (brain tissue)',[127/255, 127/255, 255/255,1]], + [202,9,'Fish (gut)',[127/255, 127/255, 255/255,1]], + [206,9,'Fish (gut)',[127/255, 127/255, 255/255,1]], + [209,10,'Mosquito',[127/255, 127/255, 255/255,1]], + [211,10,'Mosquito',[127/255, 127/255, 255/255,1]], + ['U',0,'User input',[255/255, 127/255, 127/255,1]] + ]; +my $DATA_MIC = [ + [17,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [20,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [22,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [63,2,'Mouse (fecal)',[127/255, 127/255, 255/255,1]], + [65,2,'Mouse (fecal)',[127/255, 127/255, 255/255,1]], + [68,2,'Mouse (fecal)',[127/255, 127/255, 255/255,1]], + [93,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [95,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [109,4,'Marine (open ocean)',[127/255, 127/255, 255/255,1]], + [110,4,'Marine (open ocean)',[127/255, 127/255, 255/255,1]], + [111,4,'Marine (open ocean)',[127/255, 127/255, 255/255,1]], + [120,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [124,5,'Marine (estuary)',[127/255, 127/255, 255/255,1]], + [125,5,'Marine (estuary)',[127/255, 127/255, 255/255,1]], + [134,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [146,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [148,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [201,6,'Fish (gut)',[127/255, 127/255, 255/255,1]], + [203,7,'Fish (slime)',[127/255, 127/255, 255/255,1]], + [205,6,'Fish (gut)',[127/255, 127/255, 255/255,1]], + ['U',0,'User input',[255/255, 127/255, 127/255,1]] + ]; +my $BASE64_BASES = {A => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAAzVJREFUeNrsnMFxo0AQRWe7fJcyMBnYGawyMIe9a0JQJtbefDPOAB33 +JmdgZyBlsIpgl9lCLkwJA/N7uhu0XTXlkstI8Oh+agbG355+/XDC8VaNu8htf1ZjI73DJPx59wCg +EN4phDQkNAsWGqCkIeUM7zFrSL7OBDS+VyObMyQrZWsSUlZnACfw5dwgcZ/5BZPfTEHyEwCvColL +2O24q/uuWUDKJ1TGKpCCsB8Sn4Dl1CGlbvxEBD51SCIlR4lL4VYAUnKB08SzSCSbUkFKLWxRgdMM +sii5wK1BOlksuRSQVoCwA9wjIPDVVCAhWVTWw1SZc0MK8lxHblvUP7fA569TCJyMZFET0qEa75ay +iRtSrDwDlLfG663CPohAQoRdtF4jXrrlFjgZKbU2lN/VeLFSclyQlkAzt6s95BiziVXgXJByFz/7 +WH7x+6OFbOKCFCvL0wUffeUqFYFzQELu7/eVFAKJTeCkmEVDIARXvWqXHAoJEXbwzZ4BZJ/AM21I +iLCLESV50swmMlxqzZ6pnCqkDBD2a0dvlErguRYkiSw6x16zZyKlDy4FwDbjARE4AYBihf1Se0YS +EnRSaSJZpNozxUAKaRv7QNYR/KZSEXgMpI1CFjUhifdMMZBypUzgAB0lcIoAFDv72J6ijY0tuL1P +DckrZ5GrQSM90yYlpMxh9/cfq/GHaSBPq4xeVUBCWWQt/kMaEKNWFQyFJPVAlmRsuCF5N7/wnJCW +TvaBLKkYLHC60iwadWzEWbtzFXgfpNUMhT06CeiKS23wMVKPsNdXAKlX4HTlWTToWG8SQdoxXK3H +zA7E3r0JAr/vmqXogoSu3w87vFeA9AwK3I8pN+Rr/6gAKAQ669m5qoA6hJ0r7mxsoE/Hda4qoA6i +CzDttaJI0TMRc6mFKdqDIqS9w2YtLy4LowTC1o4tdzYR83VaaQASu8Dpwh/ERuzta+441H0am8Cp +1TwuJp5FSQROTB32yRgk9Om4TwI/Q8oc9g9XCmcv2LKJmIRtERL6LfexqoAYSo3r9nUKgb+D7+HP +kFBhW8wi1p6JHL4KujQMCRX4v1UFARJyu2infBky5KIXPYn+rwADAOL8qKxS08x7AAAAAElFTkSu +QmCC', + C => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAA7BJREFUeNrsnM1xqzAQxxUNDfBKwCWQCt7g+7vgEkgJ5pRDTnYJpgRz +eXeYVBBKCCU8SvAzM6sZxuMPaXclQaydYYKTGPBv/7tagdYvp9NJTO3Px6dwZPl5S2A/hdf3rD9v +1eT1nvuC/r7/vvr7SLizDGAUEzgmNr5nN3mt9ksAWNu6cNuQYoCyhX0bpmANoK4K9tlMWrrw0euH +8/YPPkTsQKkxnIv9nNKSZ79BQb5sy3kNkjnnfMMFzsFiUHNDVZVk9FyDTMguBowvGDS8QTpejDpz +tARAZT4gNRr1zZyswYCSrk84Azuahp58MkAqoR9NkjkG0m7BgG5V76yQcgtD/B6mFqvz9nJlW8Pf +uacdha6zI0P6B6YLbGH6UGv+b3tRbnCNpgdwDpuSOEr9cU61AXXUBOX9YlJWolOVS4MwyxnUs2L6 +cAr2G1MhzAKJKu8K1DMw55UKYFHVlFMhYe//KKuZPH7v+CXxGCyQsNZbBjTNUzURUoyFlFEmhhAK +g3BjVDUVWEg5MV90DgvEy3vgppZi66ScGAKurTJMDxXAvXuPPMLGqUYy7T1A6mBLHxSlRg6MMPLT +hOTLWnBuNVELKS9GD5I2ttDzCalkSOJaiTsmKKkVP8wks4qE4xHNKyRKhd0HSCHcyCPb4LDC9g4p +DqFmL9yGZ4EUkrbhBDeYBSWJoKQAKViAFCAFSLOERKl1kqCkoKSgJFMl9QGSPUijpQHSE6rppypJ +tU5Y7Qig3IL1vZ5ydNJ403BcdzSuZBt71Rp4ncxJSbFHSNmN36melxMAK6iQhgWrSWf9wu6KylBL +byiQCo+hliIcqlTmFFLmaZSjOKfCQFIrNLDmuqUrIULqsHO3muhVl+UAxSl3F3lIDQlSHhMZ9XAQ +w9tKqOlAUs2/lBA4OAgz6jlIkDjUlFsEpTqOqGsXeiokqppUfmqYQy+BY1Lz3sPPJg0O1DPkDXSL +5xV1fjEAanVKHZM7kxtG72ObCjN4L9eAoLUQ36SVqwNFcdQ/GWzTUL6V+7aTn5zhqh0dpl/DUYLE +ueZm6lshhHDbEd4Lg8WnmAcBG7H8dZFGqQMDSfWa9QsG1NmGpOS6XiAoVC+vJMb164JCr8TWe9SH +kwOAqmcO6I1SEEvGON/MEI5KC5QWL9bH3KOaVjNSVQXXQ15XLi14TrW0+1r03kIKYGtrlRYvdM0h +dUPlvMI5WQeTyIFXW/Cqeu5VMIPpheUuTZdfobifjDTTXvxYcz5YXsBxtrD+vwADADoA0kx0ZQr1 +AAAAAElFTkSuQmCC', + G => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAA6lJREFUeNrsnN2RqjAUgANjA9wS3CefsQQsAUtgS8AStAQsQUuQEuR5 +nzYlSAkumTnZy/UKJpyTEANn5syqs0L8cn4hIbjf70zI19eaWZS40aT1Pm80Uvhe1ei59b6Ez8hl +tbr+vl5YhpLCa8xx4h54RqCZhCQsI9OwEkYEr2700OgRXqMlNARn3+gN/kbMrrTPXzS6dA2SHFzO +3BBhyd8wrtEhJTAYV+A8Sg7ji8eCJGbpQmHWhkWM7wrJwxqkCODk7L3kpDvmBWJW3sF6+qyfQRY0 +YknvDqgNKjUByRdAUgqVYK4DKQJ/9gWQ/E0FJSQl6gNExIVdo9tGgw5dw/8cDJw/fhXIA8UGN8cW +ZA9ybPVaQ4vEjHDSapgI/qzBDRXjEBUgAeaj0U8EIAl5Dcepidwux7hbQTRTG3ApTmyRa6LOP+3q +M0OFLybIk1fwQ0pmRjhMQEVgTdkQSHsCQBti6+mzVE5gTVqQMmS6l4BqZkckKGymi3UhYQa8tQio +7Xo7gisaSpASZHrdWXCxvrqLI61JqcFNkW52HLmSPmrG0yOA5ezfGw2dxaSI8t9s+GXXjcFMppOp +bj21WgWhoHMyX90tSRCAuAOAZEws4XecdS6LPJOFik9qmq0rsqE6UEic1VyCxExBWiJcrRoh5Y8C +CeNqJfNUKCFVU4GEaUP4DGm2JDQkb63oEVKEyGz1lCCxGZJaMemKiKL2PpJeuiDNme0NLck7SNFU +INUzJLOQ2AzptSxnSLO7kaTyyGdQVJC8drmQsJOPpwJpDt4KkDCXYBPisYmbCgFSuSl3qxHuFk3B +krDWlE0FEiZ4p1OBdEZmuHgKkDjSmrIpQMJaU2Yg0zkJCXtPfz8FSDUSVOwTqL4rk9gtCvnI2Y6s +6e6DRLEg6zRSfBLnvNqAJOST4BwXyxZVMOLtZq8gcUazMOtkIUaJrHozUYKo3C2hWm6cgwtQu5/c +qV2Y6h1VINUMv4C8nfUuoBnyOALOHSzU6GWaQOOBLntmZue2XDLMe4rYpHWVwcbu8XK1uv4uTNXZ +zb1j/z+thkJS1xtj3Tu4W+bxYq22JWEgyZ1APoPaPhbSQ9YCSFC+rbYVE//xLC4OXTAhQR08ASTi +7bqr1AkJDr59YziiUP7zarIplt6cu8zUcTjKu8Gp1idxsCjXg/qB/d1yrzxO6pVuJcyQS6VCBWEh +GNpiBYYfoSiLz/0IYM6gg/rO9qbAwOJzJmVrgd0l3pdEGFXGbUP6EWAA2LwDwtC8jpAAAAAASUVO +RK5CYII=', + T => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAALRJREFUeNrs09ENQDAUQFHEXlhAYgJWMJnEBLqBUWxQFkCC/si5yftq +mzYnaR5jzM4KXXu++J9CNc311YYi022QIEGCBAkSJEiCBCll5c16k+DO4Zj+4dnxmPXj92xvkZYE +SPWLs2uiN/lukCBBggQJkiBBggQJEiRIggQJEiRIkCBBEiRIkCBBggRJkCBBggQJEiRICCBBggQJ +EiRIggQJEiRIkCAJEiRIkCBBggRJ1+0CDAAzsw5U48snWgAAAABJRU5ErkJggg==', + N => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAACXBIWXMAAAsTAAALEwEAmpwYAAAK +T2lDQ1BQaG90b3Nob3AgSUNDIHByb2ZpbGUAAHjanVNnVFPpFj333vRCS4iAlEtvUhUIIFJCi4AU +kSYqIQkQSoghodkVUcERRUUEG8igiAOOjoCMFVEsDIoK2AfkIaKOg6OIisr74Xuja9a89+bN/rXX +Pues852zzwfACAyWSDNRNYAMqUIeEeCDx8TG4eQuQIEKJHAAEAizZCFz/SMBAPh+PDwrIsAHvgAB +eNMLCADATZvAMByH/w/qQplcAYCEAcB0kThLCIAUAEB6jkKmAEBGAYCdmCZTAKAEAGDLY2LjAFAt +AGAnf+bTAICd+Jl7AQBblCEVAaCRACATZYhEAGg7AKzPVopFAFgwABRmS8Q5ANgtADBJV2ZIALC3 +AMDOEAuyAAgMADBRiIUpAAR7AGDIIyN4AISZABRG8lc88SuuEOcqAAB4mbI8uSQ5RYFbCC1xB1dX +Lh4ozkkXKxQ2YQJhmkAuwnmZGTKBNA/g88wAAKCRFRHgg/P9eM4Ors7ONo62Dl8t6r8G/yJiYuP+ +5c+rcEAAAOF0ftH+LC+zGoA7BoBt/qIl7gRoXgugdfeLZrIPQLUAoOnaV/Nw+H48PEWhkLnZ2eXk +5NhKxEJbYcpXff5nwl/AV/1s+X48/Pf14L7iJIEyXYFHBPjgwsz0TKUcz5IJhGLc5o9H/LcL//wd +0yLESWK5WCoU41EScY5EmozzMqUiiUKSKcUl0v9k4t8s+wM+3zUAsGo+AXuRLahdYwP2SycQWHTA +4vcAAPK7b8HUKAgDgGiD4c93/+8//UegJQCAZkmScQAAXkQkLlTKsz/HCAAARKCBKrBBG/TBGCzA +BhzBBdzBC/xgNoRCJMTCQhBCCmSAHHJgKayCQiiGzbAdKmAv1EAdNMBRaIaTcA4uwlW4Dj1wD/ph +CJ7BKLyBCQRByAgTYSHaiAFiilgjjggXmYX4IcFIBBKLJCDJiBRRIkuRNUgxUopUIFVIHfI9cgI5 +h1xGupE7yAAygvyGvEcxlIGyUT3UDLVDuag3GoRGogvQZHQxmo8WoJvQcrQaPYw2oefQq2gP2o8+ +Q8cwwOgYBzPEbDAuxsNCsTgsCZNjy7EirAyrxhqwVqwDu4n1Y8+xdwQSgUXACTYEd0IgYR5BSFhM +WE7YSKggHCQ0EdoJNwkDhFHCJyKTqEu0JroR+cQYYjIxh1hILCPWEo8TLxB7iEPENyQSiUMyJ7mQ +AkmxpFTSEtJG0m5SI+ksqZs0SBojk8naZGuyBzmULCAryIXkneTD5DPkG+Qh8lsKnWJAcaT4U+Io +UspqShnlEOU05QZlmDJBVaOaUt2ooVQRNY9aQq2htlKvUYeoEzR1mjnNgxZJS6WtopXTGmgXaPdp +r+h0uhHdlR5Ol9BX0svpR+iX6AP0dwwNhhWDx4hnKBmbGAcYZxl3GK+YTKYZ04sZx1QwNzHrmOeZ +D5lvVVgqtip8FZHKCpVKlSaVGyovVKmqpqreqgtV81XLVI+pXlN9rkZVM1PjqQnUlqtVqp1Q61Mb +U2epO6iHqmeob1Q/pH5Z/YkGWcNMw09DpFGgsV/jvMYgC2MZs3gsIWsNq4Z1gTXEJrHN2Xx2KruY +/R27iz2qqaE5QzNKM1ezUvOUZj8H45hx+Jx0TgnnKKeX836K3hTvKeIpG6Y0TLkxZVxrqpaXllir +SKtRq0frvTau7aedpr1Fu1n7gQ5Bx0onXCdHZ4/OBZ3nU9lT3acKpxZNPTr1ri6qa6UbobtEd79u +p+6Ynr5egJ5Mb6feeb3n+hx9L/1U/W36p/VHDFgGswwkBtsMzhg8xTVxbzwdL8fb8VFDXcNAQ6Vh +lWGX4YSRudE8o9VGjUYPjGnGXOMk423GbcajJgYmISZLTepN7ppSTbmmKaY7TDtMx83MzaLN1pk1 +mz0x1zLnm+eb15vft2BaeFostqi2uGVJsuRaplnutrxuhVo5WaVYVVpds0atna0l1rutu6cRp7lO +k06rntZnw7Dxtsm2qbcZsOXYBtuutm22fWFnYhdnt8Wuw+6TvZN9un2N/T0HDYfZDqsdWh1+c7Ry +FDpWOt6azpzuP33F9JbpL2dYzxDP2DPjthPLKcRpnVOb00dnF2e5c4PziIuJS4LLLpc+Lpsbxt3I +veRKdPVxXeF60vWdm7Obwu2o26/uNu5p7ofcn8w0nymeWTNz0MPIQ+BR5dE/C5+VMGvfrH5PQ0+B +Z7XnIy9jL5FXrdewt6V3qvdh7xc+9j5yn+M+4zw33jLeWV/MN8C3yLfLT8Nvnl+F30N/I/9k/3r/ +0QCngCUBZwOJgUGBWwL7+Hp8Ib+OPzrbZfay2e1BjKC5QRVBj4KtguXBrSFoyOyQrSH355jOkc5p +DoVQfujW0Adh5mGLw34MJ4WHhVeGP45wiFga0TGXNXfR3ENz30T6RJZE3ptnMU85ry1KNSo+qi5q +PNo3ujS6P8YuZlnM1VidWElsSxw5LiquNm5svt/87fOH4p3iC+N7F5gvyF1weaHOwvSFpxapLhIs +OpZATIhOOJTwQRAqqBaMJfITdyWOCnnCHcJnIi/RNtGI2ENcKh5O8kgqTXqS7JG8NXkkxTOlLOW5 +hCepkLxMDUzdmzqeFpp2IG0yPTq9MYOSkZBxQqohTZO2Z+pn5mZ2y6xlhbL+xW6Lty8elQfJa7OQ +rAVZLQq2QqboVFoo1yoHsmdlV2a/zYnKOZarnivN7cyzytuQN5zvn//tEsIS4ZK2pYZLVy0dWOa9 +rGo5sjxxedsK4xUFK4ZWBqw8uIq2Km3VT6vtV5eufr0mek1rgV7ByoLBtQFr6wtVCuWFfevc1+1d +T1gvWd+1YfqGnRs+FYmKrhTbF5cVf9go3HjlG4dvyr+Z3JS0qavEuWTPZtJm6ebeLZ5bDpaql+aX +Dm4N2dq0Dd9WtO319kXbL5fNKNu7g7ZDuaO/PLi8ZafJzs07P1SkVPRU+lQ27tLdtWHX+G7R7ht7 +vPY07NXbW7z3/T7JvttVAVVN1WbVZftJ+7P3P66Jqun4lvttXa1ObXHtxwPSA/0HIw6217nU1R3S +PVRSj9Yr60cOxx++/p3vdy0NNg1VjZzG4iNwRHnk6fcJ3/ceDTradox7rOEH0x92HWcdL2pCmvKa +RptTmvtbYlu6T8w+0dbq3nr8R9sfD5w0PFl5SvNUyWna6YLTk2fyz4ydlZ19fi753GDborZ752PO +32oPb++6EHTh0kX/i+c7vDvOXPK4dPKy2+UTV7hXmq86X23qdOo8/pPTT8e7nLuarrlca7nuer21 +e2b36RueN87d9L158Rb/1tWeOT3dvfN6b/fF9/XfFt1+cif9zsu72Xcn7q28T7xf9EDtQdlD3YfV +P1v+3Njv3H9qwHeg89HcR/cGhYPP/pH1jw9DBY+Zj8uGDYbrnjg+OTniP3L96fynQ89kzyaeF/6i +/suuFxYvfvjV69fO0ZjRoZfyl5O/bXyl/erA6xmv28bCxh6+yXgzMV70VvvtwXfcdx3vo98PT+R8 +IH8o/2j5sfVT0Kf7kxmTk/8EA5jz/GMzLdsAAAAgY0hSTQAAeiUAAICDAAD5/wAAgOkAAHUwAADq +YAAAOpgAABdvkl/FRgAAAh1JREFUeNrsmk1xwzAQRr8RgYRBwqBhkDJoGbQMagZ1GbgMVAYNA5dB +wsBm4CBwL9Wx0Uwk7593Z3z0SHmRn3fXi3me8d8FoAUw33kdQB/9PXu9xWCeZ4QFN9zBSCwJ6Qig +cUj5aAFsHdLt2Fh47ALBGi8AHh2ScYlTQXrQLPFAuJZaiVNC2gCIDikfTxolHhjWjA4pH7s/Pzmk +TDQA9g7JUCYeGNdWI/HAvH50SEYkHgTs4V26xIOQfUSHlI8jgGeHlI9OagEsCdIOQtspQdh+REo8 +CPzjokNSKPGlIJ0qnKatdUgdgJ/CArhdw+NW+qZ6A888ASmkM4DPCifSvLhbANdCib9ahzRV+JHs +mThFCvCtXeJUeVLpaWKVOBWkAcCH1kycMuPuAIwF97PNE1BCqiHxlkPi1LVbX1iysHyK4ihwm8Lc +iXwojAPSUOE0dNYhJbdctEics5/UVAC9tQ6pB/BVKPFoHVINiZPME3BDmirUZdE6pPSmKimAF58n +kPIhoKlw/946pDPKupiLZuKSPim1FSR+sA6pRgG8sQ4JKO9iYg2QAAGNfw2QBpR3Mc1DSrnT6JCW +l7h5SKkAPjmk5QvgVUAaIGAeQDqklImPDkl47qQFUo+yLuYqILFKXBOkCUzTJZogpUz84pAESlwj +pDPKZzHNQ0q509Uh5SXeOKR8RBB1MTVDIpO4dkgDCLqY2iGl3Gl0SMwS/x0AsYSfWCRqIfIAAAAA +SUVORK5CYII=' + }; +my $MMCHART_B2 = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAAGCAYAAAACEPQxAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAABdJREFUeNpiYGBg+M/w//9/BmwEQIABANxBD/HRDNRSAAAAAElFTkSu +QmCC'; +my $FREQCHART_L = 'iVBORw0KGgoAAAANSUhEUgAAAC8AAABvCAIAAADzHQ6XAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAABWNJREFUeNrkm91V7DoMhTOsaQBKgBKgBCjhUgKUAI/wBiUwJUAJUAKU +ACVACed+K/tcYxz5J5nYk3uOHmYFxokVWdrakj17v1rJ6+trdkw3y0x3d3dd172/v5vfPj8/d/8J +Iytq488U0+bw8PDq6oqLh4cHhn1+fta1jXQytWGB+ErLhB5cPz4+xp6z11WWr68vPvf39/WJnT4+ +PmKD17W10dwoYX57c3Oji9vbW2xTXRvpgYVknoQ2fFZfKSkhC6ETFzE7tdDm+PiY6V9eXrh+enri +8/T0NDq6RkxxwZ/EczneSJNuIVgsbVa62rmsVqsfeHN5eanV3aU4W8n5cTqWNobx9ST0G2AbPZzD +c+HcsJk2ht8ACYTiSy8Y7J9eUmG5hQRYbMQUy+SMhDasnUyVyL1NV4rgRCcs1E6btBdDUPiqnTYX +FxeA5q8dyZxYXMJ5W2jDmrqcHHMsTO7GxFDD0OauF/cI1i4bR4yRPylTmixTMOHIvPlMafMjM1xf +X/tgAOqcnZ2JSsaEMcwkVoVab29vJhlVqOozwUQ7n0sHb8br8lqZmsMzPnZivuEYgWc6MKXJ2mev +AeAe9pKwTZrzOkHdo14Yr9ceYvEP24gQDUODmbK2cRY1bYOX8BDZj1kwtpmSQy8+7oXbPnvhTiwc +u9n3UKcut6uK80UVXfb1Qm0wTGDzrNP49ojFlJZGz9EKmEFu4w1Dr3rh/pI0aeJNwIt5lCtfhsZb +LhYvixev/XC9v78fwpdPM2rLtzbALgqJW+2YpRMLXO+cUayDKrWxLWwsFnZl0aWdbYAEcjiuE7Q2 +wLdsJprZb9QDMyXRGKtYT8VytdkFql6H7/eCThR1akVJSp5lkqyt6nAWSyxO2YQFwq+zkF/Cixnj +orU0T4k/EFYiBjwCCOY/6dxZwot5DmN4jskwDW38cY6mcH+WVGT5TbZHbLB0k4mqvZtm6X6z02Tp +6gltNptVL6TCfJ5ynUsfjtXCTDhyCS/WyzCGV0cnKhOmcK+dwmJEzovBHRNN+02WFysDuofElt5g +okGe4s7semf9Rh7p8nGpNtOYaJYXi5lrjBh7Uc0wYx0e8GIfb0bU4QvixScnJ2ZQtGGiYZ4ys08a +bCr2i83dSnyiavNxnN80I4QGL06AaRte/O3FQStJfkOWAUIc06jtxSkm6naOF8FE23BQ3zZL3Z/a +Rubixd8xFcNiJ/gybjQkUxTwDrWHA3wydHR0RJ4i0eb3fMWtxmIx7MmdkSAqE+PPz89HYPGwI6ee +VAKLy89ICNZLc3jA0AqxONh4jnUY1XfWFGlt1sHSDtsU2/eLWUEVOiN6FGrN8wbq0GJ/uVuiqSPb +OIuatsEYSr0BBRvdoc0eJSrxm2EETeTFJYzO2SN77micbbY/KeVe2px4ijbcg7XH7s7Pv3c3rUcx +bw4Pz1GoR6GiGpDI7pZV4cWTexRVbDO5R1Gllz6tR1G3XzyhR1GrgpnWo6h74mWz2ciHUAVQb0aN +F8yLcVgdxtstL86wrWl5akj8Rp/3U0CN1SZ7dli4qmSpMWa2Cb2YzMBiDelIYg+GBaLSUDCy1gcH +B8MymTF+qOIfZtmQr3yzezCFvNhkZyleLJQbuwdTyItdoGB+LOebKorFk6lWmhf7NNc8gWIzinpn +h3EdmAmq+AEYPf/rVwVjgzzLi0VLsp2X3xHu++kE1MnyYh9s8jXDltrMy4vn6ZjMtZG+1y1JlqXN +KrBzAOSNO/vrYO7GRymi/eI/pwv5Z3rx36pNEXduUCst63cwJb+7a6RNYU95zqyZ3Wz7n/3uTgDY +tXHhEu7cYqX++t/dbY83vyOr2WnHEu78rwADABaBbeIZChwYAAAAAElFTkSuQmCC'; + +my $CSS_STYLE = ' +html, body, div, span, p, img { + margin: 0; + padding: 0; + border: 0; + outline: 0; + font-size: 100%; + vertical-align: baseline; + background: transparent; +} + +html, body { + font-family: Arial, Verdana; + color: #40454b; + font-size: 12px; + text-align: center; +} + +img { + padding: 0px; margin: 0px; border: none; +} + +.info-panel { + margin-top: 10px; + margin-bottom: 10px; + width: 740px; + text-align: left; +} + +.info-header { + padding-top: 20px; + padding-top: 10px; +} + +.info-header-title { + color: #126499; + text-decoration: none; + font-family: sans-serif; + font-weight: bold; + font-size: 16px; + vertical-align: baseline; + margin-right: 20px; + margin-bottom: 25px; + margin-top: 15px; +} + +.info-content { + padding: 2px; + font-family: "lucida grande",sans-serif,arial; + margin-top: 15px; + margin-bottom: 15px; +} + +.info-table-type { + min-width: 70px; + padding: 4px; + vertical-align: top; +} + +.info-table-value { + font-weight: bold; + padding-top: 4px; + padding-left: 10px; + padding-right: 10px; + vertical-align: top; +} + +hr { + background-color: #E0E0E0; + border: medium none; + color: #E0E0E0; + height: 1px; + outline: medium none; +} + +.sequencetext { + font-family: courier, "courier new"; + font-weight: normal; +} +'; + +my $VERSION = '0.6'; +my $WHAT = 'graphs-noPCA'; + +my $man = 0; +my $help = 0; +my %params = ('help' => \$help, 'h' => \$help, 'man' => \$man); +GetOptions( \%params, + 'help|h', + 'man', + 'verbose', + 'version' => sub { print "PRINSEQ-$WHAT $VERSION\n"; exit; }, + 'i=s', + 'o=s', + 'png_all', + 'html_all', + 'log:s', + 'web:s' + ) or pod2usage(2); +pod2usage(1) if $help; +pod2usage(-exitstatus => 0, -verbose => 2) if $man; + +=head1 NAME + +PRINSEQ - PReprocessing and INformation of SEQuence data + +=head1 VERSION + +PRINSEQ-graphs 0.6 + +=head1 SYNOPSIS + +perl prinseq-graphs.pl [-h] [-help] [-version] [-man] [-verbose] [-i input_graph_data_file] [-png_all] [-html_all] [-log file] + +=head1 DESCRIPTION + +PRINSEQ will help you to preprocess your genomic or metagenomic sequence data in FASTA (and QUAL) or FASTQ format. The graphs version allows users of the lite version to generate graphs similar to the web version. + +=head1 OPTIONS + +=over 8 + +=item B<-help> | B<-h> + +Print the help message; ignore other arguments. + +=item B<-man> + +Print the full documentation; ignore other arguments. + +=item B<-version> + +Print program version; ignore other arguments. + +=item B<-verbose> + +Prints status and info messages during processing. + +=item B<***** INPUT OPTIONS *****> + +=item B<-i> + +Input file containing the graph data generated by the lite version. + +=item B<***** OUTPUT OPTIONS *****> + +=item B<-o> + +By default, the output files are created in the same directory as the input file with an additional "_prinseq_graphs_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option. The file extension will be added automatically. + +=item B<-png_all> + +Use this option to generate PNG files with the graphs. + +=item B<-html_all> + +Use this option to generate a HTML file with the graphs and tables. + +=item B<-log> + +Log file to keep track of parameters, errors, etc. The log file name is optional. If no file name is given, the log file name will be "inputname.log". If the log file already exists, new content will be added to the file. + +=back + +=head1 AUTHOR + +Robert SCHMIEDER, C<< >> + +=head1 BUGS + +If you find a bug please email me at C<< >> or use http://sourceforge.net/tracker/?group_id=315449 so that I can make PRINSEQ better. + +=head1 COPYRIGHT + +Copyright (C) 2011-2012 Robert SCHMIEDER + +=head1 LICENSE + +This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. + +You should have received a copy of the GNU General Public License along with this program. If not, see . + +=cut + +# +################################################################################ +## DATA AND PARAMETER CHECKING +################################################################################ +# + +my ($file1,$command,@dataread); + +#Check if input file exists and check if file format is correct +if(exists $params{i}) { + $command .= ' -i '.$params{i}; + $file1 = $params{i}; + if($params{i} eq 'stdin') { + my $format = &checkInputFormat(); + unless($format eq 'gd') { + &printError('input data for -i is in '.uc($format).' format not in graph data format'); + } + } elsif(-e $params{i}) { + #check for file format + my $format = &checkFileFormat($file1); + unless($format eq 'gd') { + &printError('input file for -i is in '.uc($format).' format not in graph data format'); + } + } else { + &printError("could not find input file \"".$params{i}."\""); + } +} else { + &printError("you did not specify an input file containing the graph data"); +} + +#check output file name prefix +if(exists $params{o}) { + $command .= ' -o '.$params{o}; +} + +#check for output format +unless(exists $params{png_all} || exists $params{html_all}) { + &printError("No output format specified. Use -png_all and/or -html_all to generate graphs."); +} +if(exists $params{png_all}) { + $command .= ' -png_all'; +} +if(exists $params{html_all}) { + $command .= ' -html_all'; +} +if(exists $params{web}) { + $command .= ' -web'.($params{web} ? ' '.$params{web} : ''); +} + +#add remaining to log command +if(exists $params{log}) { + $command .= ' -log'.($params{log} ? ' '.$params{log} : ''); + + unless($params{log}) { + $params{log} = join("__",$file1||'nonamegiven').'.log'; + } + $params{log} = cwd().'/'.$params{log} unless($params{log} =~ /^\//); + &printLog("Executing PRINSEQ with command: \"perl prinseq-".$WHAT.".pl".$command."\""); +} + +# +################################################################################ +## DATA PROCESSING +################################################################################ +# + +my $filename = $file1; +while($filename =~ /[\w\d]+\.[\w\d]+$/) { + $filename =~ s/\.[\w\d]+$//; + last if($filename =~ /\/[^\.]+$/); +} + +if(exists $params{png_all}) { + my $graphs = &generateGraphs($params{i},$params{o}); + if(exists $params{web} && $params{web} ne 'nozip') { + #png files + if(scalar(@$graphs)) { + system("zip -j -r ".dirname($params{o})."/png_graphs.zip ".dirname($params{o}).' -i \*.png') == 0 or &printError("Cannot generate graphs ZIP file"); + } + } +} +if(exists $params{html_all}) { + &generateHtml($params{i},$params{o}); +} + +&printWeb("STATUS: done"); + +## +################################################################################# +### MISC FUNCTIONS +################################################################################# +## + +sub printError { + my $msg = shift; + print STDERR "\nERROR: ".$msg.".\n\nTry \'perl prinseq-".$WHAT.".pl -h\' for more information.\nExit program.\n"; + &printLog("ERROR: ".$msg.". Exit program.\n"); + exit(0); +} + +sub printWarning { + my $msg = shift; + print STDERR "WARNING: ".$msg.".\n"; + &printLog("WARNING: ".$msg.".\n"); +} + +sub printWeb { + my $msg = shift; + if(exists $params{web}) { + print STDERR "\n".&getTime()."$msg\n"; + } +} + +sub getTime { + return sprintf("[%02d/%02d/%04d %02d:%02d:%02d] ",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); +} + +sub printLog { + my $msg = shift; + if(exists $params{log}) { + my $time = sprintf("%02d/%02d/%04d %02d:%02d:%02d",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); + open(FH, ">>", $params{log}) or die "ERROR: Can't open file ".$params{log}.": $! \n"; + flock(FH, LOCK_EX) or die "ERROR: Cannot lock file ".$params{log}.": $! \n"; + print FH "[prinseq-".$WHAT."-$VERSION] [$time] $msg\n"; + flock(FH, LOCK_UN) or die "ERROR: cannot unlock ".$params{log}.": $! \n"; + close(FH); + } +} + +sub addCommas { + my $num = shift; + return unless(defined $num); + return $num if($num < 1000); + $num = scalar reverse $num; + $num =~ s/(\d{3})/$1\,/g; + $num =~ s/\,$//; + $num = scalar reverse $num; + return $num; +} + +sub checkFileFormat { + my $file = shift; + + my ($format,$count,$id,$fasta,$fastq,$qual,$gd,$aa); + $count = 3; + $fasta = $fastq = $qual = $gd = $aa = 0; + $format = 'unknown'; + + open(FILE,"perl -p -e 's/\r/\n/g;s/\n\n/\n/g' < $file |") or die "ERROR: Could not open file $file: $! \n"; + while () { +# chomp(); + # next unless(length($_)); + if($count-- == 0) { + last; + } elsif(!$fasta && /^\>\S+\s*/) { + $fasta = 1; + $qual = 1; + } elsif($fasta == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fasta = 2; + } elsif($qual == 1 && /^\s*\d+/) { + $qual = 2; + } elsif(!$fastq && /^\@(\S+)\s*/) { + $id = $1; + $fastq = 1; + } elsif($fastq == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fastq = 2; + } elsif($fastq == 2 && /^\+(\S*)\s*/) { + $fastq = 3 if($id eq $1 || /^\+\s*$/); + } elsif(!$gd && /^\{\"numseqs\"\:/) { + $gd = 1; + } + } + close(FILE); + if($fasta == 2) { + $format = 'fasta'; + } elsif($qual == 2) { + $format = 'qual'; + } elsif($fastq == 3) { + $format = 'fastq'; + } elsif($gd == 1) { + $format = 'gd'; + } + + return $format; +} + +sub checkInputFormat { + my ($format,$count,$id,$fasta,$fastq,$qual,$gd,$aa); + $count = 3; + $fasta = $fastq = $qual = $gd = $aa = 0; + $format = 'unknown'; + + while () { + push(@dataread,$_); +# chomp(); + # next unless(length($_)); + if($count-- == 0) { + last; + } elsif(!$fasta && /^\>\S+\s*/) { + $fasta = 1; + $qual = 1; + } elsif($fasta == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fasta = 2; + } elsif($qual == 1 && /^\s*\d+/) { + $qual = 2; + } elsif(!$fastq && /^\@(\S+)\s*/) { + $id = $1; + $fastq = 1; + } elsif($fastq == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fastq = 2; + } elsif($fastq == 2 && /^\+(\S*)\s*/) { + $fastq = 3 if($id eq $1 || /^\+\s*$/); + } elsif(!$gd && /^\{\"numseqs\"\:/) { + $gd = 1; + } + } + + if($fasta == 2) { + $format = 'fasta'; + } elsif($qual == 2) { + $format = 'qual'; + } elsif($fastq == 3) { + $format = 'fastq'; + } elsif($gd == 1) { + $format = 'gd'; + } + + return $format; +} + +sub readGdFile { + my $file = shift; + my $data; + + open(DATA,"<$file") or &printError("Could not open file $file: $!"); + while() { + next if(/^\#/); + chomp(); + if(length($_)) { + $data = from_json($_); + } + } + close(DATA); + + return $data; +} + +sub getFileName { + my $ext = shift; + my ($file,$fh); + if(exists $params{o}) { + $file = $params{o}.$ext; + open(OUT,">$file") or &printError('cannot open output file'); + close(OUT); + } else { + $fh = File::Temp->new( TEMPLATE => $filename.'_prinseq_graphs_XXXX', + SUFFIX => $ext, + UNLINK => 0); + $file = $fh->filename; + $fh->close(); + } + return $file; +} + +sub generateGraphs { + my ($in,$out) = @_; + my ($file,$data,$surface,@graphs); + $data = &readGdFile($in); + + #length plot + if(exists $data->{counts}->{length}) { + $file = &getFileName('_ld.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{length},1),$data->{stats}->{length},'Length Distribution','Read Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{counts2} && exists $data->{counts2}->{length}) { + $file = &getFileName('_ld-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{length},1),$data->{stats2}->{length},'Length Distribution','Read Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #tail plot + if(exists $data->{tail}) { + $file = &getFileName('_td5.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + $file = &getFileName('_td3.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{tail2}) { + $file = &getFileName('_td5-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + $file = &getFileName('_td3-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Ns plot + if(exists $data->{counts}->{ns}) { + $file = &getFileName('_ns.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{counts2} && exists $data->{counts2}->{ns}) { + $file = &getFileName('_ns-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #GC content plot + if(exists $data->{counts}->{gc}) { + $file = &getFileName('_gc.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{gc},0),$data->{stats}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{counts2} && exists $data->{counts2}->{gc}) { + $file = &getFileName('_gc-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{gc},0),$data->{stats2}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Sequence complexity plot - dust + if(exists $data->{compldust}) { + $file = &getFileName('_cd.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{compldust},0),undef,'Sequence complexity distribution','Mean sequence complexity (DUST scores)','Number of sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Sequence complexity plot - entropy + if(exists $data->{complentropy}) { + $file = &getFileName('_ce.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{complentropy},0),undef,'Sequence complexity distribution','Mean sequence complexity (Entropy values)','Number of sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Dinucleotide odd ratio PCA plot - microbial/viral + #Odds ratio plot + if(exists $data->{dinucodds}) { + my @new = map {$data->{dinucodds}->{$_}} sort keys %{$data->{dinucodds}}; +# $file = &getFileName('_pm.png'); +# $surface = &createPCAPlot(&convertToPCAValues(\@new,'m'),'PCA','1st Principal Component Score','2nd Principal Component Score',$file); +# $surface->write_to_png($file); +# push(@graphs,$file); +# $file = &getFileName('_pv.png'); +# $surface = &createPCAPlot(&convertToPCAValues(\@new,'v'),'PCA','1st Principal Component Score','2nd Principal Component Score',$file); +# $surface->write_to_png($file); +# push(@graphs,$file); + $file = &getFileName('_or.png'); + $surface = &createOddsRatioPlot($data->{dinucodds},'Odds ratios','Dinucleotide','Odds ratio',$file); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Qual plot + if(exists $data->{quals}) { + $file = &getFileName('_qd.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{quals},4),'Base Quality Distribution','Read position in %','Quality score',$file); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{quals2}) { + $file = &getFileName('_qd-2.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{quals2},4),'Base Quality Distribution','Read position in %','Quality score',$file); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Qualbin plot + if(exists $data->{qualsbin}) { + $file = &getFileName('_qd2.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin},4),'Base Quality Distribution','Read position in bp','Quality score',$file,0,'bp',$data->{binval}); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{qualsbin2}) { + $file = &getFileName('_qd2-2.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin2},4),'Base Quality Distribution','Read position in bp','Quality score',$file,0,'bp',$data->{binval}); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Qualmean plot + if(exists $data->{qualsmean}) { + $file = &getFileName('_qd3.png'); + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{qualsmean2}) { + $file = &getFileName('_qd3-2.png'); + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean2},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Sequence duplicate plots + if(exists $data->{dubscounts}) { + $file = &getFileName('_df.png'); + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubscounts},5,1,100),'Sequence duplication level','Number of duplicates','Number of sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{dubslength}) { + $file = &getFileName('_dl.png'); + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubslength},5,1),'Sequence duplication level','Read Length in bp','Number of duplicates',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{dubscounts}) { + my %dubsmax; + my $count = 1; + foreach my $n (sort {$b <=> $a} keys %{$data->{dubscounts}}) { + foreach my $s (keys %{$data->{dubscounts}->{$n}}) { + foreach my $i (1..$data->{dubscounts}->{$n}->{$s}) { + $dubsmax{$count++}->{$s} = $n; + last unless($count <= 100); + } + last unless($count <= 100); + } + last unless($count <= 100); + } + $file = &getFileName('_dm.png'); + $surface = &createStackBarPlot(&convertOdToStackBinMatrix(\%dubsmax,5,1,100),'Sequence duplication level','Sequence','Number of duplicates',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + + return \@graphs; +} + +sub convertOdToBinMatrix { + my ($data,$min,$max,$nonice) = @_; + + my ($num,$ymax,$xmax,$xmin,$step,%vals,$tmp,@matrix,$bin,$tmpbin); + + #make nice xmax value + if(defined $max) { + $xmax = $max; + } else { + $xmax = (sort {$b <=> $a} keys %$data)[0]; + } + $bin = &getBinVal($xmax); + $xmax = $bin*100; + $xmin = (defined $min ? $min : 0); + + #get data to bin and find y axis max value + $ymax = 0; + $tmp = 0; + $tmpbin = $bin; + foreach my $i ($xmin..$xmax) { + if(exists $data->{$i}) { + $tmp += $data->{$i}; + } + if(--$tmpbin <= 0) { + $tmpbin = $bin; + $ymax = &max($ymax,$tmp); + push(@matrix,$tmp); + $tmp = 0; + } + } + + #make nice ymax value + unless($nonice) { + $ymax = sprintf("%d",($ymax/4)+1)*4 if($ymax % 4); +# $step = ($ymax <= 10 ? 10 : ($ymax < 40 ? 40 : ($ymax < 100 ? 100 : ($ymax < 1000 ? 100 : 100)))); +# $ymax = sprintf("%d",($ymax/$step)+1)*$step if($ymax % $step); + } + + return (\@matrix,$xmax,$ymax); +} + +sub getBinVal { + my $val = shift; + my $step; + if(!$val || $val <= 100) { + return 1; + } elsif($val < 10000) { + return int($val/100)+($val % 100 ? 1 : 0); + } elsif($val < 100000) { + return 1000; + } else { + $step = 1000000; + my $xmax = ($val % $step ? sprintf("%d",($val/$step+1))*$step : $val); + return ($xmax/100); + } +} + +sub max { + my ($a,$b) = @_; + return ($a < $b ? $b : $a); +} + +sub min { + my ($a,$b) = @_; + return ($a > $b ? $b : $a); +} + +sub createAnnotBarPlot { + my ($matrix,$xmax,$ymax,$annot,$title,$xlab,$ylab,$file,$zero,$add) = @_; + + my $bin = 1; + if($xmax > 100) { + $bin = $xmax / 100; + $xmax = 100; + } + + my @barcol = (127/255, 127/255, 255/255, 1); #b2b2ff + my @meancol = (255/255, 127/255, 127/255, 1); #ffb2b2 + my @stdcol = (178/255, 178/255, 255/255, 0.8); #7f7fff + my @std1col = (0, 0, 0, 0.04); #ff7f7f + my @std2col = (0, 0, 0, 0.03); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @helplinecol = (1, 1, 1, 0.9); + my @background = (0.95, 0.95, 0.95, 1); + my @tickcol = (0, 0, 0, 0.8); + my @labelcol = (0, 0, 0, 1); + + #create new image + my $size = 6; + my $offset = 20; + my $left = 40; + my $bottom = 15; + my $top = 20; + my $height = 200; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$top+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); + + $cr->save; + + #set up work space + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(@background); + $cr->fill; + + #draw ticks + #x-axis + $cr->set_source_rgba(@tickcol); + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); + } else { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+1); + } + } + $cr->stroke; + + #y-axis + $cr->move_to($left+$offset, $top+$offset); + $cr->line_to($left+$offset-3, $top+$offset); + $cr->move_to($left+$offset, $top+$offset+$height-1); + $cr->line_to($left+$offset-3, $top+$offset+$height-1); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(@tickcol); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%10) == 0 && $i > 1 && $i < $xmax) { + $extents = $cr->text_extents($i*$bin); + $cr->move_to($left+$offset+int($size/2+1)+$size*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i*$bin); + } + } + #y-axis + $extents = $cr->text_extents(&addCommas($ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2); + $cr->show_text(&addCommas($ymax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + $cr->set_source_rgba(@labelcol); + $extents = $cr->text_extents($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab.($bin>1 ? ' (per bin)' : '')); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset+10); + $cr->show_text($ylab.($bin>1 ? ' (per bin)' : '')); + + $cr->restore; + + #draw annotations + if($annot) { + $cr->set_antialias('none'); + my ($std1l,$std2l,$std1r,$std2r); + #std boxes + $std1l = int($annot->{mean})-int($annot->{std}); + $std2l = int($annot->{mean})-2*int($annot->{std}); + $std1r = int($annot->{mean})+int($annot->{std}); + $std2r = int($annot->{mean})+2*int($annot->{std}); + unless($std1l == $std1r) { + if($std1l < 0) { + $std1l = 0; + } else { + $std1l = int($std1l/$bin); + } + if($std2l < 0) { + $std2l = 0; + } else { + $std2l = int($std2l/$bin); + } + if($std1r/$bin > 100) { + $std1r = 100; + } else { + $std1r = int($std1r/$bin); + } + if($std2r/$bin > 100) { + $std2r = 100; + } else { + $std2r = int($std2r/$bin); + } + $cr->rectangle($left+$offset+$std2l*$size+2, $top+$offset, ($std2r-$std2l)*$size, $height); + $cr->set_source_rgba(@std2col); + $cr->fill; + $cr->rectangle($left+$offset+$std1l*$size+2, $top+$offset, ($std1r-$std1l)*$size, $height); + $cr->set_source_rgba(@std1col); + $cr->fill; + #mean line + $cr->set_source_rgba(@meancol); + $cr->move_to($left+$offset+int(int($annot->{mean})/$bin)*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+int(int($annot->{mean})/$bin)*$size+2, $top+$offset+$height); + $cr->stroke; + #std lines + $cr->set_source_rgba(@stdcol); + if($std1l > 0) { + $cr->move_to($left+$offset+$std1l*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std1l*$size+2, $top+$offset+$height); + } + if($std2l > 0) { + $cr->move_to($left+$offset+$std2l*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std2l*$size+2, $top+$offset+$height); + } + if($std1r < 100) { + $cr->move_to($left+$offset+$std1r*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std1r*$size+2, $top+$offset+$height); + } + if($std2r < 100) { + $cr->move_to($left+$offset+$std2r*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std2r*$size+2, $top+$offset+$height); + } + $cr->stroke; + #labels + $cr->set_antialias('default'); + $cr->set_source_rgba(@tickcol); + $extents = $cr->text_extents('M'); + $cr->move_to($left+$offset+int(int($annot->{mean})/$bin)*$size+2-$extents->{width}/2, $top+$offset-10); + $cr->show_text('M'); + if($std1l > 0) { + $extents = $cr->text_extents('1SD'); + $cr->move_to($left+$offset+$std1l*$size-$extents->{width}/2+2, $top+$offset-10); + $cr->show_text('1SD'); + } + if($std2l > 0) { + $extents = $cr->text_extents('2SD'); + $cr->move_to($left+$offset+$std2l*$size-$extents->{width}/2+3, $top+$offset-10); + $cr->show_text('2SD'); + } + if($std1r < 100) { + $extents = $cr->text_extents('1SD'); + $cr->move_to($left+$offset+$std1r*$size-$extents->{width}/2+2, $top+$offset-10); + $cr->show_text('1SD'); + } + if($std2r < 100) { + $extents = $cr->text_extents('2SD'); + $cr->move_to($left+$offset+$std2r*$size-$extents->{width}/2+3, $top+$offset-10); + $cr->show_text('2SD'); + } + } + } + + #draw boxes + $cr->set_antialias('none'); + $cr->set_source_rgba(@barcol); + foreach my $pos (0..$xmax-($zero ? 0 : 1)) { + next unless($matrix->[$pos]); + my $tmp = $matrix->[$pos] / $ymax; + #unique + if($tmp) { + $cr->rectangle($left+$offset+$pos*$size, $top+$offset+$height, $size-1, -$tmp*$height); + $cr->fill; + } + } + + #write image + $cr->show_page; + return $surface; +} + +#sub convertToPCAValues { +# my ($new,$type) = @_; +# +# my @data = ($type eq 'v' ? @$DINUCODDS_VIR : @$DINUCODDS_MIC); +# +# push(@data,$new); +# +# my $pca = Statistics::PCA->new; +# $pca->load_data({format => 'table', data => \@data}); +# $pca->pca(); +# +# my @variances = $pca->results('proportion'); +# my @list = $pca->results('transformed'); +# +# my ($xmin,$xmax,$ymin,$ymax); +# $xmax = $ymax = -100; +# $xmin = $ymin = 100; +# +# #get min/max values for PC1 +# foreach my $v (@{$list[0]}) { +# $xmax = &max($xmax,$v); +# $xmin = &min($xmin,$v); +# } +# #get min/max values for PC2 +# foreach my $v (@{$list[1]}) { +# $ymax = &max($ymax,$v); +# $ymin = &min($ymin,$v); +# } +# +# return ([$list[0],$list[1]],sprintf("%d",$variances[0]*100),sprintf("%d",$variances[1]*100),$xmin,$xmax,$ymin,$ymax,$type); +#} + +sub createPCAPlot { + my ($data,$var1,$var2,$xmin,$xmax,$ymin,$ymax,$type,$title,$xlab,$ylab,$file) = @_; + + my @linecol = (0, 0, 0, 0.4); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = 5; + my $offset = 20; + my $left = 25; + my $bottom = 15; + my $top = ($type eq 'v' ? 35 : 20); + my $height = 500; + my $space = 10; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+$height+2*$space,$top+$bottom+$offset*2+$height+2*$space); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+$height+2*$space,$top+$bottom+$offset*2+$height+2*$space); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans-serif', 'normal', 'normal'); + + $cr->save; + + #set up work space + my ($dx, $dy); + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, $height+2*$space, $height+2*$space); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #get infos + my $num = scalar(@{$data->[0]})-1; + my $xrange = ($xmax-$xmin); + my $yrange = ($ymax-$ymin); + my $data_info = ($type eq 'v' ? $DATA_VIR : $DATA_MIC); + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->move_to($left+$offset+$space, $top+$offset+$height+2*$space); + $cr->line_to($left+$offset+$space, $top+$offset+$height+2*$space+3); + $cr->move_to($left+$offset+$space+$height, $top+$offset+$height+2*$space); + $cr->line_to($left+$offset+$space+$height, $top+$offset+$height+2*$space+3); + $cr->move_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset+$height+2*$space); + $cr->line_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset+$height+2*$space+3); + $cr->stroke; + #y-axis + $cr->move_to($left+$offset, $top+$offset+$space); + $cr->line_to($left+$offset-3, $top+$offset+$space); + $cr->move_to($left+$offset, $top+$offset+$height+$space); + $cr->line_to($left+$offset-3, $top+$offset+$height+$space); + $cr->move_to($left+$offset, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->line_to($left+$offset-3, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol1); + $cr->move_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset); + $cr->line_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset+$height+2*$space); + $cr->stroke; + $cr->move_to($left+$offset, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->line_to($left+$offset+2*$space+$height, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->stroke; + + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + $extents = $cr->text_extents(sprintf("%.2f",$xmin)); + $cr->move_to($left+$offset+$space-$extents->{width}/2-1, $top+$offset+$height+2*$space+$fontheight+2); + $cr->show_text(sprintf("%.2f",$xmin)); + $extents = $cr->text_extents(sprintf("%.2f",$xmax)); + $cr->move_to($left+$offset+$space+$height-$extents->{width}/2-1, $top+$offset+$height+2*$space+$fontheight+2); + $cr->show_text(sprintf("%.2f",$xmax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset+$space+int(abs($xmin)/$xrange*$height)-$extents->{width}/2, $top+$offset+$height+2*$space+$fontheight+2); + $cr->show_text(0); + #y-axis + $extents = $cr->text_extents(sprintf("%.2f",$ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$space+$fontheight/2-2); + $cr->show_text(sprintf("%.2f",$ymax)); + $extents = $cr->text_extents(sprintf("%.2f",$ymin)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$height+$space+$fontheight/2-2); + $cr->show_text(sprintf("%.2f",$ymin)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$space+int(abs($ymax)/$yrange*$height)+$fontheight/2-2); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #add type + $cr->set_source_rgba(0, 0, 0, 0.5); + $extents = $cr->text_extents(uc($type)); + $cr->arc($offset/2+$extents->{width}/2, $offset-5, 10, 0, 2*$PI); + $cr->fill; + $cr->set_source_rgba(1, 1, 1, 1); + $cr->move_to($offset/2-($type eq 'm' ? 1 : 0), $offset); + $cr->show_text(uc($type)); + + #axis labels + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents($xlab.' ('.$var1.'%)'); + $cr->move_to($left+$offset+$height/2-$extents->{width}/2+$space, $top+$offset+$height+$fontheight+15+2*$space); + $cr->show_text($xlab.' ('.$var1.'%)'); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab.' ('.$var2.'%)'); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2)+$space,$offset); + $cr->show_text($ylab.' ('.$var2.'%)'); + + $cr->restore; + + #draw dots + $cr->set_antialias('default'); + $cr->set_font_size (10); + foreach my $i (0..$num) { + $cr->set_source_rgba(@{$data_info->[$i]->[3]}); + $cr->arc(($left+$offset+$space+int(($data->[0]->[$i]+abs($xmin))/$xrange*$height)), ($space+$top+$offset+int(($data->[1]->[$i]+abs($ymin))/$yrange*$height)), $size, 0, 2*$PI); + $cr->fill; + } + $cr->set_source_rgba(0, 0, 0, 1); + foreach my $i (0..$num) { + $extents = $cr->text_extents($data_info->[$i]->[1]); + $cr->move_to(($left+$offset+$space+int(($data->[0]->[$i]+abs($xmin))/$xrange*$height))+$size+1, ($space+$top+$offset+int(($data->[1]->[$i]+abs($ymin))/$yrange*$height))+$size*2); + $cr->show_text($data_info->[$i]->[1]); + } + + #draw legend + my %labels; + foreach my $i (0..$num) { + $labels{$data_info->[$i]->[1]} = $data_info->[$i]->[2]; + } + $cr->set_font_size(10); + $fontheight = $font_extents->{height}; + $cr->set_source_rgba(0, 0, 0, 1); + my $x = $left+$offset+$space; + my $y = int($offset/2); + foreach my $n (sort {$a <=> $b} keys %labels) { + if($x+$cr->text_extents($n.' - '.$labels{$n})->{width}+15 >= $left+$offset+$space+$height) { + $x = $left+$offset+$space; + $y += $fontheight; + } + $cr->move_to($x,$y); + $cr->show_text($n.' - '.$labels{$n}); + $x += $cr->text_extents($n.' - '.$labels{$n})->{width}+15; + + } + + #write image + $cr->show_page; + return $surface; +} + +sub createOddsRatioPlot { + my ($data,$title,$xlab,$ylab,$file) = @_; + + my @yvalues = (0.5,0.78,1.00,1.23,1.5); + + my @linecol = (0, 0, 0, 0.4); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = 40; + my $offset = 20; + my $left = 35; + my $right = 90; + my $bottom = 20; + my $top = 0; + my $height = 100; + my $width = $size*10; + my $space = 20; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+$width+$right,$top+$bottom+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+$width+$right,$top+$bottom+$offset*2+$height); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); + + $cr->save; + + #set up work space + my ($dx, $dy); + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, $width, $height); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #right side marks + $cr->set_source_rgba(255/255, 127/255, 127/255, 0.6); + $cr->rectangle($left+$offset+$width+8, $top+$offset, 3, 0.77/2*$height); + $cr->fill; + $cr->rectangle($left+$offset+$width+8, $top+$offset+$height-0.78/2*$height, 3, 0.78/2*$height); + $cr->fill; + + #get infos + my $num = scalar(keys %$data)-1; + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); + foreach my $i (0..$num) { + $cr->move_to($left+$offset+$size/2+$i*$size, $top+$offset+$height); + $cr->line_to($left+$offset+$size/2+$i*$size, $top+$offset+$height+3); + } + $cr->stroke; + #y-axis + foreach my $i (@yvalues) { + $cr->move_to($left+$offset, $top+$offset+$height-$i/2*$height); + $cr->line_to($left+$offset-3, $top+$offset+$height-$i/2*$height); + } + $cr->stroke; + + #helplines + #x-axis + $cr->set_source_rgba(@helplinecol1); + foreach my $i (0..$num) { + $cr->move_to($left+$offset+$size/2+$i*$size, $top+$offset); + $cr->line_to($left+$offset+$size/2+$i*$size, $top+$offset+$height); + } + $cr->stroke; + #yaxis + foreach my $i (@yvalues) { + $cr->set_source_rgba(0, 0, 0, ($i == 0.5 || $i == 1.00 || $i == 1.50 ? 0.1 : 0.3)); + $cr->move_to($left+$offset, $top+$offset+$height-$i/2*$height); + $cr->line_to($left+$offset+$width, $top+$offset+$height-$i/2*$height); + $cr->stroke; + } + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + my $xcur = 0; + foreach my $dn (map {join("/",(m/../g ))} sort keys %$data) { + $extents = $cr->text_extents($dn); + $cr->move_to($left+$offset+$size/2-$extents->{width}/2-1+$size*$xcur++, $top+$offset+$height+$fontheight+2); + $cr->show_text($dn); + } + #y-axis + foreach my $i (@yvalues) { + $cr->set_source_rgba(0, 0, 0, ($i == 0.5 || $i == 1.00 || $i == 1.50 ? 0.5 : 0.8)); + $extents = $cr->text_extents(sprintf("%.2f",$i)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$height-$i/2*$height+$fontheight/2-2); + $cr->show_text(sprintf("%.2f",$i)); + } + + #label on right side + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents('Over-represented'); + $cr->move_to($left+$offset+$width+15, $top+$offset+$height-1.6/2*$height+$fontheight/2-2); + $cr->show_text('Over-represented'); + $extents = $cr->text_extents('Under-represented'); + $cr->move_to($left+$offset+$width+15, $top+$offset+$height-0.4/2*$height+$fontheight/2-2); + $cr->show_text('Under-represented'); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + #x-axis + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents($xlab); + $cr->move_to($left+$offset+$width/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab); + #y-axis + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset); + $cr->show_text($ylab); + + $cr->restore; + + #draw dots + $cr->set_antialias('default'); + $xcur = 0; + foreach my $dn (sort keys %$data) { + if($data->{$dn} > 1.23 || $data->{$dn} < 0.78) { + $cr->set_source_rgba(255/255, 127/255, 127/255, 1); + } else { + $cr->set_source_rgba(127/255, 127/255, 255/255, 1); + } + $cr->arc($left+$offset+$size/2+$size*$xcur++, $top+$offset+$height-$data->{$dn}/2*$height, 5, 0, 2*$PI); + $cr->fill; + } + + #write image + $cr->show_page; + return $surface; +} + +sub convertToBoxValues { + my ($data,$niceval) = @_; + my ($xmax,$ymax,@matrix); + $xmax = $ymax = 0; + foreach my $i (sort {$a <=> $b} keys %$data) { + $xmax++; + push(@matrix,[$i,$data->{$i}->{min},$data->{$i}->{p25},$data->{$i}->{median},$data->{$i}->{p75},$data->{$i}->{max}]); + $ymax = &max($ymax,$data->{$i}->{max}); + } + + if($niceval) { + $ymax = sprintf("%d",($ymax/$niceval)+1)*$niceval if($ymax % $niceval); + } + + return (\@matrix,$xmax,$ymax); +} + +sub createBoxPlot { + my ($matrix,$xmax,$ymax,$title,$xlab,$ylab,$file,$zero,$add,$bin) = @_; + $bin = ($bin ? $bin : 1); + $zero = 0 unless($zero); + $add = '' unless(defined $add); + if($xmax != 100) { + $xmax = 100; + } + $ymax = 1 unless($ymax); +# die Dumper $matrix; + + + my @col0 = (178/255, 178/255, 255/255); #b2b2ff + my @col1 = (255/255, 178/255, 178/255); #ffb2b2 + my @col3 = (127/255, 127/255, 255/255); #7f7fff + my @col4 = (255/255, 127/255, 127/255); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @linecol0 = (@col3, 1); + my @linecol1 = (@col4, 1); + my @boxcol = (@col3, 1); + my @whiscol = (@col0, 0.9); + my @medcol = (0,0,0, 0.5); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = 6; + my $offset = 20; + my $left = 25; + my $bottom = 25; + my $top = 5; + my $height = 300; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); +# $cr->set_font_size (30); + + $cr->save; + + #set up work space + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #draw legend + $cr->set_font_size(10); +# $font_extents = $cr->font_extents; + my $x = $left+$offset+$size*50; + foreach my $v ([\@whiscol,'Min/Max value'],[\@boxcol,'25th to 75th percentile'],[\@medcol,'Median']) { + $cr->set_antialias('none'); + $cr->set_source_rgba(@{$v->[0]}); + $cr->rectangle($x, $top+5, 10, 10); + $cr->fill; + $x += 15; + $cr->set_antialias('default'); + $cr->move_to($x,$top+5+9); + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->show_text($v->[1]); + $x += $cr->text_extents($v->[1])->{width}+15; + } + + $cr->set_antialias('none'); + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); +# $cr->move_to($left+$offset+int($size/2+1), $top+$offset+$height); +# $cr->line_to($left+$offset+int($size/2+1), $top+$offset+$height+3); +# $cr->move_to($left+$offset+int($size/2+1), $top+$offset+$height+$space); +# $cr->line_to($left+$offset+int($size/2+1), $top+$offset+$height+$space-3); + foreach my $i (1..9) { + $cr->move_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); +# $cr->move_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset); +# $cr->line_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset-3); + } + $cr->stroke; + #y-axis + foreach my $j (0..4) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset-3, $top+$offset+$height*$j/4-($j ? 1 : 0)); +# $cr->move_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); +# $cr->line_to($left+$offset+($xmax+$zero)*$size+3, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol1); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis +# $extents = $cr->text_extents(1); +# $cr->move_to($left+$offset+int($size/2+1)-$extents->{width}, $top+$offset+$height+$fontheight+2); +# $cr->show_text(1); + foreach my $i (1..9) { + $extents = $cr->text_extents($i*10*$bin); + $cr->move_to($left+$offset+int($size/2+1)+$size*10*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i*10*$bin); + } + #y-axis + foreach my $j (0..4) { + $extents = $cr->text_extents(&addCommas($ymax*$j/4)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height*(4-$j)/4); + $cr->show_text(&addCommas($ymax*$j/4)); + } + + $cr->save; + + #axis labels + $cr->set_source_rgba(0, 0, 0, 1); + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + $extents = $cr->text_extents($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset); + $cr->show_text($ylab); + + $cr->restore; + + #draw boxes + my $factor = $height/$ymax; + $cr->set_antialias('none'); + foreach my $v (@$matrix) { + #wiskers + $cr->set_source_rgba(@whiscol); + if($v->[1] != $v->[2]) { + $cr->move_to($left+$offset+$size*$v->[0]+1, $top+$offset+$height-$v->[1]*$factor-1); + $cr->line_to($left+$offset+$size*$v->[0]+$size-2, $top+$offset+$height-$v->[1]*$factor-1); + $cr->stroke; + } + if($v->[4] != $v->[5]) { + $cr->move_to($left+$offset+$size*$v->[0]+1, $top+$offset+$height-$v->[5]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+$size-2, $top+$offset+$height-$v->[5]*$factor); + $cr->stroke; + } + $cr->save; + $cr->set_dash(1,4,3); + if($v->[1] != $v->[2]) { + $cr->move_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[2]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[1]*$factor); + $cr->stroke; + } + if($v->[4] != $v->[5]) { + $cr->move_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[5]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[4]*$factor-1); + $cr->stroke; + } + $cr->restore; + #box + if(($v->[2] != $v->[3]) || ($v->[4] != $v->[3])) { + $cr->set_source_rgba(@whiscol); + $cr->rectangle($left+$offset+$size*$v->[0], $top+$offset+$height-$v->[2]*$factor, $size-1, -($v->[4]-$v->[2])*$factor); + $cr->fill; + $cr->stroke; + $cr->set_source_rgba(@boxcol); + $cr->rectangle($left+$offset+$size*$v->[0], $top+$offset+$height-$v->[2]*$factor, $size-2, -($v->[4]-$v->[2])*$factor); + $cr->stroke; + } else { + $cr->set_source_rgba(@boxcol); + $cr->move_to($left+$offset+$size*$v->[0], $top+$offset+$height-$v->[3]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+$size-1, $top+$offset+$height-$v->[3]*$factor); + $cr->stroke; + } + #median + $cr->set_source_rgba(@medcol); + $cr->move_to($left+$offset+$size*$v->[0]+1, $top+$offset+$height-$v->[3]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+$size-2, $top+$offset+$height-$v->[3]*$factor); + $cr->stroke; + } + + #write image + $cr->show_page; + return $surface; +} + +sub convertToBarValues { + my ($data,$niceval,$start,$max) = @_; + my ($xmax,$ymax,@matrix,$tmp); + $xmax = $ymax = 0; + + #get xmax value + if($max) { + $xmax = $max; + } else { + foreach my $q (keys %$data) { + $xmax = &max($xmax,$q); + } + } + if($niceval) { + $xmax = sprintf("%d",($xmax/$niceval)+1)*$niceval if($xmax % $niceval); + } + + #get matrix values + foreach my $q ($start..$xmax) { + $tmp = (exists $data->{$q} ? $data->{$q} : 0); + $ymax = &max($ymax,$tmp); + push(@matrix,$tmp); + } + + $ymax = sprintf("%d",($ymax/4)+1)*4 if($ymax % 4); + + return (\@matrix,$xmax,$ymax); +} + +sub createBarPlot { + my ($matrix,$xmax,$ymax,$title,$xlab,$ylab,$file,$zero) = @_; + + my @col0 = (178/255, 178/255, 255/255); #b2b2ff + my @col1 = (255/255, 178/255, 178/255); #ffb2b2 + my @col3 = (127/255, 127/255, 255/255); #7f7fff + my @col4 = (255/255, 127/255, 127/255); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @linecol0 = (@col3, 1); + my @linecol1 = (@col4, 1); + my @barcol0 = (@col3, 1); + my @barcol1 = (@col4, 1); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = ($xmax <= 50 ? 10 : ($xmax <= 100 ? 6 : 3)); + my $offset = 20; + my $left = 25; + my $bottom = 15; + my $top = 0; + my $height = 200; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); +# $cr->set_font_size (30); + + $cr->save; + + #set up work space + my ($dx, $dy); + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); + } else { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+1); + } + } + $cr->stroke; + + #y-axis + $cr->move_to($left+$offset, $top+$offset); + $cr->line_to($left+$offset-3, $top+$offset); + $cr->move_to($left+$offset, $top+$offset+$height-1); + $cr->line_to($left+$offset-3, $top+$offset+$height-1); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol1); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $extents = $cr->text_extents($i); + $cr->move_to($left+$offset+int($size/2+1)+$size*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i); + } + } + #y-axis + $extents = $cr->text_extents(&addCommas($ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2); + $cr->show_text(&addCommas($ymax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents($xlab); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset); + $cr->show_text($ylab); + + $cr->restore; + + #draw boxes + $cr->set_antialias('none'); + foreach my $pos (0..$xmax-($zero ? 0 : 1)) { + next unless($matrix->[$pos+($zero ? 0 : 1)]); + my $tmp = $matrix->[$pos+($zero ? 0 : 1)] / $ymax; + #unique + if($tmp) { + $cr->set_source_rgba(@barcol0); + $cr->rectangle($left+$offset+($pos+($zero ? 0 : 1))*$size, $top+$offset+$height, $size-1, -$tmp*$height); + $cr->fill; + } + } + + #write image + $cr->show_page; + return $surface; +} + +sub convertOdToStackBinMatrix { + my ($data,$stacks,$min,$max,$nonice) = @_; + + my ($num,$ymax,$xmax,$xmin,$step,%vals,%sums,$sum,@matrix,$bin,$tmpbin); + + #make nice xmax value + if(defined $max) { + $xmax = $max; + } else { + $xmax = (sort {$b <=> $a} keys %$data)[0]; + } + $bin = &getBinVal($xmax); + $xmax = $bin*100; + $xmin = (defined $min ? $min : 0); + + #get data to bin and find y axis max value + $ymax = 0; + foreach my $s (0..$stacks-1) { + $sums{$s} = 0; + } + $sum = 0; + $tmpbin = $bin; + foreach my $i ($xmin..$xmax) { + foreach my $s (0..$stacks-1) { + next unless(exists $data->{$i}->{$s}); + $sums{$s} += $data->{$i}->{$s}; + $sum += $data->{$i}->{$s}; + } + if(--$tmpbin <= 0) { + $tmpbin = $bin; + $ymax = &max($ymax,$sum); + $sum = 0; + foreach my $s (0..$stacks-1) { + push(@{$matrix[$s]},$sums{$s}); + $sums{$s} = 0; + } + } + } + + #make nice ymax value + unless($nonice) { + $ymax = sprintf("%d",($ymax/4)+1)*4 if($ymax % 4); +# $step = ($ymax <= 10 ? 10 : ($ymax < 40 ? 40 : ($ymax < 100 ? 100 : ($ymax < 1000 ? 100 : 100)))); +# $ymax = sprintf("%d",($ymax/$step)+1)*$step if($ymax % $step); + } + + return (\@matrix,$xmax,$ymax,$stacks); +} + +sub createStackBarPlot { + my ($matrix,$xmax,$ymax,$stacks,$title,$xlab,$ylab,$file,$zero,$add) = @_; + + my $bin = 1; + if($xmax > 100) { + $bin = $xmax / 100; + $xmax = 100; + } + + my @legend = ('Exact dupl.','5\' dupl.','3\' dupl.','Rev. compl. exact dupl.','Rev. compl. 5\'/3\' dupl.'); + my @cols = ([69/255, 114/255, 167/255, 1], + [137/255, 1165/255, 78/255, 1], + [170/255, 70/255, 67/255, 1], + [147/255, 169/255, 207/255, 1], + [51/255, 102/255, 102/255, 1]); + my @barcol = (127/255, 127/255, 255/255, 1); #b2b2ff + my @meancol = (255/255, 127/255, 127/255, 1); #ffb2b2 + my @stdcol = (178/255, 178/255, 255/255, 0.8); #7f7fff + my @std1col = (0, 0, 0, 0.02); #ff7f7f + my @std2col = (0, 0, 0, 0.02); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @helplinecol = (1, 1, 1, 0.9); + my @background = (0.95, 0.95, 0.95, 1); + my @tickcol = (0, 0, 0, 0.8); + my @labelcol = (0, 0, 0, 1); + + #create new image + my $size = 6; + my $offset = 20; + my $left = 40; + my $bottom = 15; + my $top = 20; + my $height = 200; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$top+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); + + $cr->save; + + #set up work space + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(@background); + $cr->fill; + + #draw legend + $cr->set_font_size(10); +# $font_extents = $cr->font_extents; + my $x = $left+$offset+$size*100-5; + foreach my $i (reverse (0..scalar(@legend)-1)) { + $cr->set_antialias('default'); + $x -= $cr->text_extents($legend[$i])->{width}; + $cr->move_to($x,$top+5+9); + $cr->set_source_rgba(@tickcol); + $cr->show_text($legend[$i]); + $x -= 15; + $cr->set_antialias('none'); + $cr->set_source_rgba(@{$cols[$i]}); + $cr->rectangle($x, $top+5, 10, 10); + $cr->fill; + $x -= 15; + } + + #draw ticks + $cr->set_antialias('none'); + #x-axis + $cr->set_source_rgba(@tickcol); + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); + } else { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+1); + } + } + $cr->stroke; + + #y-axis + $cr->move_to($left+$offset, $top+$offset); + $cr->line_to($left+$offset-3, $top+$offset); + $cr->move_to($left+$offset, $top+$offset+$height-1); + $cr->line_to($left+$offset-3, $top+$offset+$height-1); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(@tickcol); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%10) == 0 && $i > 1 && $i < $xmax) { + $extents = $cr->text_extents($i*$bin); + $cr->move_to($left+$offset+int($size/2+1)+$size*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i*$bin); + } + } + #y-axis + $extents = $cr->text_extents(&addCommas($ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2); + $cr->show_text(&addCommas($ymax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size(14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + $cr->set_source_rgba(@labelcol); + $extents = $cr->text_extents($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab.($bin>1 ? ' (per bin)' : '')); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2+($bin>1 ? 12 : 0)),$offset+10); + $cr->show_text($ylab.($bin>1 ? ' (per bin)' : '')); + + $cr->restore; + + #draw boxes + $cr->set_antialias('none'); + foreach my $pos (0..$xmax-($zero ? 0 : 1)) { + my $tmp = 0; + foreach my $s (0..$stacks-1) { + next unless($matrix->[$s]->[$pos]); + my $cur = $matrix->[$s]->[$pos] / $ymax; + $cr->set_source_rgba(@{$cols[$s]}); + if($cur) { + $cr->rectangle($left+$offset+$pos*$size, $top+$offset+$height-$tmp*$height, $size-1, -$cur*$height); + $cr->fill; + } + $tmp += $cur; + } + } + + #write image + $cr->show_page; + return $surface; +} + +sub header { + return ' + + + +PRINSEQ-'.$WHAT.' Report + + + +
'; +} + +sub footer { + return '
'; +} + +sub generateHtml { + my ($in,$out) = @_; + my ($file,$data,$surface,$html,$png); + $data = &readGdFile($in); + my $time = sprintf("%02d/%02d/%04d %02d:%02d:%02d",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); + + $html .= &header(); + $html .= '

PRINSEQ-'.$WHAT.' v'.$VERSION.' HTML Report   

[Generated: '.$time.']

'; + $html .= '
'; + + #input info + if(exists $data->{numseqs}) { + $html .= '
Input Information
'; + $html .= '
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + my $singletons1 = ($data->{numseqs}||0)-($data->{pairs}||0); + my $singletons2 = ($data->{numseqs2}||0)-($data->{pairs}||0); + $html .= ''; + } else { + $html .= ''; + } + $html .= '
Input file(s):'.($data->{filename1} ? &convertIntToString($data->{filename1}) : '-').($data->{filename2} ? ' and '.&convertIntToString($data->{filename2}) : '').'
Input format(s):'.($data->{format1} ? uc($data->{format1}) : '-').($data->{format2} ? ' and '.uc($data->{format2}) : '').'
# Sequences (file 1):'.&addCommas($data->{numseqs}||'-').'
Total bases (file 1):'.&addCommas($data->{numbases}||'-').'
# Sequences (file 2):'.&addCommas($data->{numseqs2}||'-').'
Total bases (file 2):'.&addCommas($data->{numbases2}||'-').'
# Pairs:'.&addCommas($data->{pairs}||'-').($data->{pairs} ? '  ('.sprintf("%.2f",(100*(2*$data->{pairs})/(($data->{numseqs}||0)+($data->{numseqs2}||0)))).'% of sequences)' : '').'
# Singletons (file 1):'.&addCommas($singletons1).($singletons1 ? '  ('.sprintf("%.2f",(100*$singletons1/$data->{numseqs})).'%)' : '').'
# Singletons (file 2):'.&addCommas($singletons2).($singletons2 ? '  ('.sprintf("%.2f",(100*$singletons2/$data->{numseqs2})).'%)' : '').'
# Sequences:'.&addCommas($data->{numseqs}||'-').'
Total bases:'.&addCommas($data->{numbases}||'-').'

'; + } + + #length plot + if(exists $data->{counts}->{length} && keys %{$data->{counts}->{length}}) { + $html .= '
Length Distribution
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= '
File 1
Mean sequence length: '.(exists $data->{stats}->{length}->{mean} ? sprintf("%.2f",$data->{stats}->{length}->{mean}) : '-').' ± '.(exists $data->{stats}->{length}->{std} ? sprintf("%.2f",$data->{stats}->{length}->{std}) : '-').' bp
Minimum length: '.(exists $data->{stats}->{length}->{min} ? &addCommas($data->{stats}->{length}->{min}) : '-').' bp
Maximum length:'.(exists $data->{stats}->{length}->{max} ? &addCommas($data->{stats}->{length}->{max}) : '-').' bp
Length range:'.(exists $data->{stats}->{length}->{range} ? &addCommas($data->{stats}->{length}->{range}) : '-').' bp
Mode length: '.(exists $data->{stats}->{length}->{mode} ? &addCommas($data->{stats}->{length}->{mode}) : '-').' bp with '.(exists $data->{stats}->{length}->{modeval} ? &addCommas($data->{stats}->{length}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{length},1),$data->{stats}->{length},'Length Distribution','Read Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + $html .= '

File 2
Mean sequence length: '.(exists $data->{stats2}->{length}->{mean} ? sprintf("%.2f",$data->{stats2}->{length}->{mean}) : '-').' ± '.(exists $data->{stats2}->{length}->{std} ? sprintf("%.2f",$data->{stats2}->{length}->{std}) : '-').' bp
Minimum length: '.(exists $data->{stats2}->{length}->{min} ? &addCommas($data->{stats2}->{length}->{min}) : '-').' bp
Maximum length:'.(exists $data->{stats2}->{length}->{max} ? &addCommas($data->{stats2}->{length}->{max}) : '-').' bp
Length range:'.(exists $data->{stats2}->{length}->{range} ? &addCommas($data->{stats2}->{length}->{range}) : '-').' bp
Mode length: '.(exists $data->{stats2}->{length}->{mode} ? &addCommas($data->{stats2}->{length}->{mode}) : '-').' bp with '.(exists $data->{stats2}->{length}->{modeval} ? &addCommas($data->{stats2}->{length}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{length},1),$data->{stats2}->{length},'Length Distribution','Read Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } else { + $html .= '
Mean sequence length: '.(exists $data->{stats}->{length}->{mean} ? sprintf("%.2f",$data->{stats}->{length}->{mean}) : '-').' ± '.(exists $data->{stats}->{length}->{std} ? sprintf("%.2f",$data->{stats}->{length}->{std}) : '-').' bp
Minimum length: '.(exists $data->{stats}->{length}->{min} ? &addCommas($data->{stats}->{length}->{min}) : '-').' bp
Maximum length:'.(exists $data->{stats}->{length}->{max} ? &addCommas($data->{stats}->{length}->{max}) : '-').' bp
Length range:'.(exists $data->{stats}->{length}->{range} ? &addCommas($data->{stats}->{length}->{range}) : '-').' bp
Mode length: '.(exists $data->{stats}->{length}->{mode} ? &addCommas($data->{stats}->{length}->{mode}) : '-').' bp with '.(exists $data->{stats}->{length}->{modeval} ? &addCommas($data->{stats}->{length}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{length},1),$data->{stats}->{length},'Length Distribution','Read Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } + $html .= '
'; + } + + #GC content + if(exists $data->{counts}->{gc} && keys %{$data->{counts}->{gc}}) { + $html .= '
GC Content Distribution
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= '
File 1
Mean GC content: '.(exists $data->{stats}->{gc}->{mean} ? sprintf("%.2f",$data->{stats}->{gc}->{mean}) : '-').' ± '.(exists $data->{stats}->{gc}->{std} ? sprintf("%.2f",$data->{stats}->{gc}->{std}) : '-').' %
Minimum GC content: '.(exists $data->{stats}->{gc}->{min} ? $data->{stats}->{gc}->{min} : '-').' %
Maximum GC content: '.(exists $data->{stats}->{gc}->{max} ? $data->{stats}->{gc}->{max} : '-').' %
GC content range: '.(exists $data->{stats}->{gc}->{range} ? $data->{stats}->{gc}->{range} : '-').' %
Mode GC content: '.(exists $data->{stats}->{gc}->{mode} ? $data->{stats}->{gc}->{mode} : '-').' % with '.(exists $data->{stats}->{gc}->{modeval} ? &addCommas($data->{stats}->{gc}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{gc},0),$data->{stats}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + $html .= '

File 2
Mean GC content: '.(exists $data->{stats2}->{gc}->{mean} ? sprintf("%.2f",$data->{stats2}->{gc}->{mean}) : '-').' ± '.(exists $data->{stats2}->{gc}->{std} ? sprintf("%.2f",$data->{stats2}->{gc}->{std}) : '-').' %
Minimum GC content: '.(exists $data->{stats2}->{gc}->{min} ? $data->{stats2}->{gc}->{min} : '-').' %
Maximum GC content: '.(exists $data->{stats2}->{gc}->{max} ? $data->{stats2}->{gc}->{max} : '-').' %
GC content range: '.(exists $data->{stats2}->{gc}->{range} ? $data->{stats2}->{gc}->{range} : '-').' %
Mode GC content: '.(exists $data->{stats2}->{gc}->{mode} ? $data->{stats2}->{gc}->{mode} : '-').' % with '.(exists $data->{stats2}->{gc}->{modeval} ? &addCommas($data->{stats2}->{gc}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{gc},0),$data->{stats2}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } else { + $html .= '
Mean GC content: '.(exists $data->{stats}->{gc}->{mean} ? sprintf("%.2f",$data->{stats}->{gc}->{mean}) : '-').' ± '.(exists $data->{stats}->{gc}->{std} ? sprintf("%.2f",$data->{stats}->{gc}->{std}) : '-').' %
Minimum GC content: '.(exists $data->{stats}->{gc}->{min} ? $data->{stats}->{gc}->{min} : '-').' %
Maximum GC content: '.(exists $data->{stats}->{gc}->{max} ? $data->{stats}->{gc}->{max} : '-').' %
GC content range: '.(exists $data->{stats}->{gc}->{range} ? $data->{stats}->{gc}->{range} : '-').' %
Mode GC content: '.(exists $data->{stats}->{gc}->{mode} ? $data->{stats}->{gc}->{mode} : '-').' % with '.(exists $data->{stats}->{gc}->{modeval} ? &addCommas($data->{stats}->{gc}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{gc},0),$data->{stats}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } + $html .= '
'; + } + + #Base quality + if(exists $data->{quals} || exists $data->{qualsmean} || exists $data->{qualsbin}) { + $html .= '
Base Quality Distribution
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + } + if(exists $data->{quals} && keys %{$data->{quals}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{quals},4),'Base Quality Distribution','Read position in %','Quality score',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + } + if(exists $data->{qualsbin} && keys %{$data->{qualsbin}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin},4),'Base Quality Distribution','Read position in bp','Quality score','',0,'bp',$data->{binval}); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{quals}); + $html .= &insert_image($png); + } + if(exists $data->{qualsmean} && keys %{$data->{qualsmean}}) { + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{qualsbin}); + $html .= &insert_image($png); + } + if(exists $data->{pairedend} && $data->{pairedend}) { + if(exists $data->{quals} || exists $data->{qualsmean} || exists $data->{qualsbin}) { + $html .= '


File 2
'; + } + if(exists $data->{quals2} && keys %{$data->{quals2}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{quals2},4),'Base Quality Distribution','Read position in %','Quality score',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + } + if(exists $data->{qualsbin2} && keys %{$data->{qualsbin2}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin2},4),'Base Quality Distribution','Read position in bp','Quality score','',0,'bp',$data->{binval}); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{quals2}); + $html .= &insert_image($png); + } + if(exists $data->{qualsmean2} && keys %{$data->{qualsmean2}}) { + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean2},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{qualsbin2}); + $html .= &insert_image($png); + } + } + if(exists $data->{quals} || exists $data->{qualsmean} || exists $data->{qualsbin}) { + $html .= '

'; + } + + #Ns + if((exists $data->{counts}->{ns} && keys %{$data->{counts}->{ns}}) || (exists $data->{counts2} && exists $data->{counts2}->{ns} && keys %{$data->{counts2}->{ns}})) { + $html .= '
Occurence of N
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + } + if(exists $data->{counts}->{ns} && keys %{$data->{counts}->{ns}}) { + my $nscount = 0; + foreach my $n (values %{$data->{counts}->{ns}}) { + $nscount += $n; + } + $html .= '
Sequences with N: '.($nscount ? &addCommas($nscount).'  ('.sprintf("%.2f",100/$data->{numseqs}*$nscount).' %)' : 0).'
Max percentage of Ns per sequence: '.(exists $data->{stats}->{ns}->{max} ? $data->{stats}->{ns}->{max} : 0).' %
'; + if($nscount) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if(exists $data->{pairedend} && $data->{pairedend} && exists $data->{counts2}->{ns} && keys %{$data->{counts2}->{ns}}) { + $html .= '


File 2
'; + my $nscount = 0; + foreach my $n (values %{$data->{counts2}->{ns}}) { + $nscount += $n; + } + $html .= '
Sequences with N: '.($nscount ? &addCommas($nscount).'  ('.sprintf("%.2f",100/$data->{numseqs2}*$nscount).' %)' : 0).'
Max percentage of Ns per sequence: '.(exists $data->{stats2}->{ns}->{max} ? $data->{stats2}->{ns}->{max} : 0).' %
'; + if($nscount) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if((exists $data->{counts}->{ns} && keys %{$data->{counts}->{ns}}) || (exists $data->{counts2} && exists $data->{counts2}->{ns} && keys %{$data->{counts2}->{ns}})) { + $html .= '

'; + } + + #tails + if(exists $data->{tail} || exists $data->{tail2}) { + $html .= '
Poly-A/T Tails
'; + } + if(exists $data->{tail}) { + my $tail5count = 0; + foreach my $n (values %{$data->{counts}->{tail5}}) { + $tail5count += $n; + } + my $tail3count = 0; + foreach my $n (values %{$data->{counts}->{tail3}}) { + $tail3count += $n; + } + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + $html .= '
5\'-end 3\'-end
Sequences with tail:'.($tail5count ? &addCommas($tail5count).'  ('.sprintf("%.2f",100/$data->{numseqs}*$tail5count).' %)' : 0).' '.($tail3count ? &addCommas($tail3count).'  ('.sprintf("%.2f",100/$data->{numseqs}*$tail3count).' %)' : 0).'
Maximum tail length: '.(exists $data->{stats}->{tail5}->{max} ? $data->{stats}->{tail5}->{max} : 0).' '.(exists $data->{stats}->{tail3}->{max} ? $data->{stats}->{tail3}->{max} : 0).'
'; + if($tail5count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + if($tail3count) { + $html .= '
'; + } + } + if($tail3count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if(exists $data->{pairedend} && $data->{pairedend} && exists $data->{tail2}) { + my $tail5count = 0; + foreach my $n (values %{$data->{counts2}->{tail5}}) { + $tail5count += $n; + } + my $tail3count = 0; + foreach my $n (values %{$data->{counts2}->{tail3}}) { + $tail3count += $n; + } + $html .= '


File 2
'; + $html .= '
5\'-end 3\'-end
Sequences with tail:'.($tail5count ? &addCommas($tail5count).'  ('.sprintf("%.2f",100/$data->{numseqs2}*$tail5count).' %)' : 0).' '.($tail3count ? &addCommas($tail3count).'  ('.sprintf("%.2f",100/$data->{numseqs2}*$tail3count).' %)' : 0).'
Maximum tail length: '.(exists $data->{stats2}->{tail5}->{max} ? $data->{stats2}->{tail5}->{max} : 0).' '.(exists $data->{stats2}->{tail3}->{max} ? $data->{stats2}->{tail3}->{max} : 0).'
'; + if($tail5count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + if($tail3count) { + $html .= '
'; + } + } + if($tail3count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if(exists $data->{tail} || exists $data->{tail2}) { + $html .= '

'; + } + + + #tag sequence check + if(exists $data->{freqs} || exists $data->{freqs2}) { + $html .= '
Tag Sequence Check
'; + } + if(exists $data->{freqs}) { + my $tagmidseq; + if(exists $data->{tagmidseq}) { + $tagmidseq = $data->{tagmidseq}; + $tagmidseq =~ s/\,/\
/g; + } + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + $html .= '
5\'-end3\'-end
Probability of tag sequence:'.(exists $data->{tagprob}->{5} ? $data->{tagprob}->{5}.' %' : '-').''.(exists $data->{tagprob}->{3} ? $data->{tagprob}->{3}.' %' : '-').'
GSMIDs or RLMIDs:'.(exists $data->{tagmidnum} ? ($data->{tagmidnum} == 0 ? 'none' : ($tagmidseq ? $tagmidseq : $data->{tagmidnum})) : '-').' 

'; + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs}->{5}}) { + $html .= ''; + } + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs}->{3}}) { + $html .= ''; + } + $html .= ''; + $html .= ''; + foreach my $num (1,0,0,0,5,0,0,0,0,10,0,0,0,0,15,0,0,0,0,20,0,20,0,0,0,0,15,0,0,0,0,10,0,0,0,0,5,0,0,0,1) { + $html .= ''; + } + $html .= ''; + $html .= '
'.&insert_image($FREQCHART_L,undef,undef,1).''; + foreach my $base (qw(A C G T N)) { + if($data->{freqs}->{5}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs}->{5}->{$pos}->{$base},14,1).'
'; + #''.$base.'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 ... '; + foreach my $base (qw(A C G T N)) { + if($data->{freqs}->{3}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs}->{3}->{$pos}->{$base},14,1).'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 '.($num ? $num : '').' 
 Position from Sequence Ends
'; + } + if(exists $data->{pairedend} && $data->{pairedend} && exists $data->{freqs2}) { + $html .= '


File 2
'; + $html .= '
5\'-end3\'-end
Probability of tag sequence:'.(exists $data->{tagprob2}->{5} ? $data->{tagprob2}->{5}.' %' : '-').''.(exists $data->{tagprob2}->{3} ? $data->{tagprob2}->{3}.' %' : '-').'

'; + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs2}->{5}}) { + $html .= ''; + } + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs2}->{3}}) { + $html .= ''; + } + $html .= ''; + $html .= ''; + foreach my $num (1,0,0,0,5,0,0,0,0,10,0,0,0,0,15,0,0,0,0,20,0,20,0,0,0,0,15,0,0,0,0,10,0,0,0,0,5,0,0,0,1) { + $html .= ''; + } + $html .= ''; + $html .= '
'.&insert_image($FREQCHART_L,undef,undef,1).''; + foreach my $base (qw(A C G T N)) { + if($data->{freqs2}->{5}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs2}->{5}->{$pos}->{$base},14,1).'
'; + #''.$base.'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 ... '; + foreach my $base (qw(A C G T N)) { + if($data->{freqs2}->{3}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs2}->{3}->{$pos}->{$base},14,1).'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 '.($num ? $num : '').' 
 Position from Sequence Ends
'; + } + if(exists $data->{freqs} || exists $data->{freqs2}) { + $html .= '

'; + } + + #Sequence duplicates + if(exists $data->{dubslength} || exists $data->{dubscounts}) { + $html .= '
Sequence Duplication
'; + } + my %dubs; + if(exists $data->{dubscounts} && keys %{$data->{dubscounts}}) { + my $exactonly = $data->{exactonly}||0; + foreach my $n (keys %{$data->{dubscounts}}) { + foreach my $s (keys %{$data->{dubscounts}->{$n}}) { + $dubs{$s}->{count} += $data->{dubscounts}->{$n}->{$s} * $n; + $dubs{$s}->{max} = $n unless(exists $dubs{$s}->{max} && $dubs{$s}->{max} > $n); + $dubs{all} += $data->{dubscounts}->{$n}->{$s} * $n; + } + } + $html .= '
'; + unless($exactonly) { + $html .= ''; + } + $html .= '
# Sequences Max duplicates
Exact duplicates:'.(exists $dubs{0}->{count} ? &addCommas($dubs{0}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{0}->{count}).' %)' : 0).''.($dubs{0}->{max}||0).'
Exact duplicates with reverse complements:'.(exists $dubs{3}->{count} ? &addCommas($dubs{3}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{3}->{count}).' %)' : 0).' '.($dubs{3}->{max}||0).'
5\' duplicates'.(exists $dubs{1}->{count} ? &addCommas($dubs{1}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{1}->{count}).' %)' : 0).' '.($dubs{1}->{max}||0).'
3\' duplicates'.(exists $dubs{2}->{count} ? &addCommas($dubs{2}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{2}->{count}).' %)' : 0).' '.($dubs{2}->{max}||0).'
5\'/3\' duplicates with reverse complements'.(exists $dubs{4}->{count} ? &addCommas($dubs{4}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{4}->{count}).' %)' : 0).' '.($dubs{4}->{max}||0).'
Total:'.(exists $dubs{all} ? &addCommas($dubs{all}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{all}).' %)' : 0).'-
'; + } + if(exists $dubs{all} && $dubs{all}) { + if(exists $data->{dubslength} && keys %{$data->{dubslength}}) { + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubslength},5,1),'Sequence duplication level','Read Length in bp','Number of duplicates','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + if(exists $data->{dubscounts} && keys %{$data->{dubscounts}}) { + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubscounts},5,1,100),'Sequence duplication level','Number of duplicates','Number of sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{dubslength}); + $html .= &insert_image($png); + my %dubsmax; + my $count = 1; + foreach my $n (sort {$b <=> $a} keys %{$data->{dubscounts}}) { + foreach my $s (keys %{$data->{dubscounts}->{$n}}) { + foreach my $i (1..$data->{dubscounts}->{$n}->{$s}) { + $dubsmax{$count++}->{$s} = $n; + last unless($count <= 100); + } + last unless($count <= 100); + } + last unless($count <= 100); + } + $surface = &createStackBarPlot(&convertOdToStackBinMatrix(\%dubsmax,5,1,100),'Sequence duplication level','Sequence','Number of duplicates','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{dubslength}); + $html .= &insert_image($png); + } + } + if(exists $data->{dubslength} || exists $data->{dubscounts}) { + $html .= '

'; + } + + #Sequence complexity + if(exists $data->{compldust} || exists $data->{complentropy}) { + $html .= '
Sequence Complexity
'; + if(exists $data->{complvals}) { + my $complseq; + foreach my $d (keys %{$data->{complvals}}) { + foreach my $m ('minseq','maxseq') { + $complseq = $data->{complvals}->{$d}->{$m}; + $complseq = substr($complseq,0,797).'...' if(length($complseq) > 800); + $complseq =~ s/(.{60})/$1\
/g; + $data->{complvals}->{$d}->{$m} = $complseq; + } + } + } + $html .= '
ValueSequence
Minimum DUST score:'.(exists $data->{complvals}->{dust}->{minval} ? $data->{complvals}->{dust}->{minval} : '-').''.(exists $data->{complvals}->{dust}->{minseq} ? $data->{complvals}->{dust}->{minseq} : '').'
Maximum DUST score:'.(exists $data->{complvals}->{dust}->{maxval} ? $data->{complvals}->{dust}->{maxval} : '').''.(exists $data->{complvals}->{dust}->{maxseq} ? $data->{complvals}->{dust}->{maxseq} : '').'
Minimum Entropy value:'.(exists $data->{complvals}->{entropy}->{minval} ? $data->{complvals}->{entropy}->{minval} : '').''.(exists $data->{complvals}->{entropy}->{minseq} ? $data->{complvals}->{entropy}->{minseq} : '').'
Maximum Entropy value:'.(exists $data->{complvals}->{entropy}->{maxval} ? $data->{complvals}->{entropy}->{maxval} : '').''.(exists $data->{complvals}->{entropy}->{maxseq} ? $data->{complvals}->{entropy}->{maxseq} : '').'

'; + } + if(exists $data->{compldust}) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{compldust},0),undef,'Sequence complexity distribution','Mean sequence complexity (DUST scores)','Number of sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + } + if(exists $data->{complentropy}) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{complentropy},0),undef,'Sequence complexity distribution','Mean sequence complexity (Entropy values)','Number of sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{compldust}); + $html .= &insert_image($png); + } + if(exists $data->{compldust} || exists $data->{complentropy}) { + $html .= '

'; + } + + #Dinucleotide odd ratio PCA - microbial/viral + if(exists $data->{dinucodds} && keys %{$data->{dinucodds}}) { + $html .= '
Dinucleotide Odds Ratios
'; + $html .= '
'; + foreach my $d (map {join("/",(m/../g ))} sort keys %{$data->{dinucodds}}) { + $html .= ''; + } + $html .= ''; + foreach my $d (map {sprintf("%.4f",$data->{dinucodds}->{$_})} sort keys %{$data->{dinucodds}}) { + $html .= ''; + } + $html .= '
 '.$d.'
Odds ratio'.$d.'

'; + my @new = map {$data->{dinucodds}->{$_}} sort keys %{$data->{dinucodds}}; + $surface = &createOddsRatioPlot($data->{dinucodds},'Odds ratios','Dinucleotide','Odds ratio',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + #$surface = &createPCAPlot(&convertToPCAValues(\@new,'m'),'PCA','1st Principal Component Score','2nd Principal Component Score',''); + #$png = ''; + #$surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + #$html .= '

'; + #$html .= &insert_image($png); + #$surface = &createPCAPlot(&convertToPCAValues(\@new,'v'),'PCA','1st Principal Component Score','2nd Principal Component Score',''); + #$png = ''; + #$surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + #$html .= '

'; + #$html .= &insert_image($png); + $html .= '
'; + } + + $html .= '
'; + $html .= &footer(); + + #write html to file + $file = &getFileName('.html'); + open(FH, ">$file") or &printError("Can't open file ".$file.": $!"); + print FH $html; + close(FH); + &printLog("Done with HTML data"); +} + +sub insert_image { + my ($data, $height, $width, $noencode) = @_; + my $content .= ''."\n"; + return $content; +} + +sub inline_image { + return "data:image/png;base64,".MIME::Base64::encode_base64($_[0]); +} + +sub convertIntToString { + my $int = shift; + $int =~ s/(.{2})/chr(hex($1))/eg; + return $int; +} diff -r 000000000000 -r 9790cfb46d03 prinseq-graphs.pl --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/prinseq-graphs.pl Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,2665 @@ +#!/usr/bin/perl + +#=============================================================================== +# Author: Robert SCHMIEDER, Computational Science Research Center @ SDSU, CA +# +# File: prinseq-graphs +# Date: 2012-12-22 +# Version: 0.6 graphs +# +# Usage: +# prinseq-graphs [options] +# +# Try 'prinseq-graphs -h' for more information. +# +# Purpose: PRINSEQ will help you to preprocess your genomic or metagenomic +# sequence data in FASTA or FASTQ format. The graphs version allows +# users of the lite version to generate graphs similar to the web +# version. +# +# Bugs: Please use http://sourceforge.net/tracker/?group_id=315449 +# +#=============================================================================== + +use strict; +use warnings; + +use Getopt::Long; +use Pod::Usage; +use File::Temp qw(tempfile); #for output files +use Fcntl qw(:flock SEEK_END); #for log file +use Cwd; +use JSON; +use Cairo; +use Statistics::PCA; +use MIME::Base64; +use File::Basename; +use Data::Dumper; ### + +$| = 1; # Do not buffer output + +my $PI = 4 * atan2(1, 1); +my $LOG62 = log(62); +my $DINUCODDS_VIR = [ + [qw(1.086940308 0.98976932 1.034167044 0.880024041 1.070421277 0.990687084 0.890945575 1.069957074 0.92465631 0.803973303)], + [qw(1.101064857 0.986812783 1.038299155 0.896162618 1.081652847 0.976365237 0.867445186 1.06727283 0.94688543 0.768007295)], + [qw(1.071548411 0.912204166 1.196914981 0.80628184 1.294201511 1.148517794 0.269295791 1.033948026 0.895951033 0.623192149)], + [qw(1.090253719 0.907428629 1.203991784 0.786359294 1.281499107 1.145421568 0.235974709 1.033437274 0.899580091 0.631699771)], + [qw(1.075864745 1.003413074 1.01872902 0.897841689 0.980373171 1.05854979 0.934262259 1.052477953 0.88145851 0.889239724)], + [qw(1.101890467 1.030028291 1.019912674 0.84191395 1.0015174 1.069546264 0.900151602 0.996269395 0.889195343 0.904039022)], + [qw(1.152417359 0.855028574 0.91164793 1.017415486 1.114163672 1.128353311 0.846355573 0.916745489 1.206820475 0.811014651)], + [qw(1.142454218 0.8635465 0.923406967 1.026242747 1.134445058 1.131747833 0.79793368 0.920767641 1.179468556 0.799770057)], + [qw(1.124462747 0.873556143 0.945627041 1.013755408 1.159866153 1.096259526 0.757315047 0.972924919 1.105562567 0.772731886)], + [qw(1.143826972 0.866968779 0.995740249 0.945859278 1.109590621 1.089305083 0.76048874 0.971561388 1.157101408 0.792923027)], + [qw(1.131900141 0.82776996 0.996204924 0.999433455 1.024692372 1.071176333 0.921026216 1.088936699 1.054010776 0.773498892)], + [qw(1.042180476 0.930180412 1.019242897 0.98909997 1.006666828 1.046708539 0.959492164 1.011183418 1.055168776 0.937433818)], + [qw(1.086515695 0.985345815 0.930914307 0.969581792 1.043010232 1.087463712 0.939482285 0.990551965 0.954752469 0.893972874)], + [qw(1.096657826 0.950117614 0.936195529 0.965619788 1.114975275 1.077011195 0.843153131 0.989128406 1.043790912 0.840634731)], + [qw(1.158030995 0.935307365 0.874812261 1.056236525 1.117171274 0.937484692 1.057442372 0.970079538 1.174848738 0.725071711)], + [qw(1.15591506 0.93000227 0.883538923 1.0567652 1.095730954 0.944489906 1.074229471 0.983993745 1.156051409 0.726688465)], + [qw(1.205726473 0.924439339 1.049457756 0.805718412 0.975472778 1.07581991 0.726992211 1.075025787 0.8704929 0.726672843)], + [qw(1.188544681 0.95239611 1.049066985 0.790031334 1.038632598 1.056749787 0.665197397 1.057566244 0.862429061 0.708982398)], + [qw(1.063631482 0.925593715 1.014869316 0.944904401 1.119690731 1.325971834 0.273781451 0.943347677 1.06438014 0.920825904)], + [qw(1.077560287 0.911888545 1.044147857 0.927758054 1.058535939 1.296838544 0.421514996 0.945722451 1.128317986 0.926419928)], + [qw(1.163753415 0.989905668 0.893599328 0.955641844 1.176047687 0.941559156 0.950641089 0.959741692 1.100815282 0.72491925)], + [qw(1.139253929 0.946297517 0.922096125 1.024801537 1.205206793 0.968818717 0.915801342 0.971626058 1.107569276 0.627623404)] + ]; +my $DINUCODDS_MIC = [ + [qw(1.13127323 0.853587195 0.911041047 1.104520778 1.065586428 1.021434164 0.999734139 1.063684014 1.078035184 0.733596552)], + [qw(1.173267344 0.840539337 0.919534602 1.068050141 1.062394214 1.051999071 0.96770576 1.035511729 1.095600433 0.72328141)], + [qw(1.172939786 0.84567902 0.911836259 1.106288994 1.05351787 1.026143368 1.002308358 1.066319771 1.094918797 0.710733535)], + [qw(1.073527689 0.850290918 0.978455025 1.080882178 1.111174765 1.010754115 0.895668707 1.072980666 1.079304608 0.754057386)], + [qw(1.08807747 0.837444678 0.95824965 1.097310298 1.118897971 1.030863881 0.886827263 1.072349394 1.07406322 0.733440096)], + [qw(1.071685485 0.861055813 0.966566865 1.090268118 1.112945761 1.012538936 0.909535491 1.063745603 1.071156598 0.755770377)], + [qw(1.142698587 0.867936867 1.000612099 0.977934257 1.111801746 1.018318601 0.788556794 0.987763594 1.184649653 0.784776176)], + [qw(1.134560074 0.876651844 0.998190253 0.995723123 1.128448077 1.014172324 0.781776188 0.971020602 1.182411449 0.786449476)], + [qw(1.180029632 0.787899325 1.01316945 0.932268406 1.077837263 1.211699678 0.612128817 1.033036699 1.157314398 0.74940288)], + [qw(1.160925546 0.788308899 1.003702496 0.965371236 1.076051693 1.188304271 0.641536444 1.070331188 1.124067192 0.740126813)], + [qw(1.173873006 0.790118011 1.014718833 0.937979878 1.07453725 1.207167373 0.622279064 1.046150047 1.145627707 0.742212886)], + [qw(1.128383111 0.870541389 0.987269741 0.98353238 1.115643879 1.040107028 0.774505865 1.010896432 1.164757274 0.775254395)], + [qw(1.15297511 0.853883985 0.956393231 1.000027661 1.139915472 1.01355294 0.838843622 1.015553125 1.216219741 0.70447264)], + [qw(1.148264236 0.852123859 0.974568293 0.985455546 1.13192373 1.015879393 0.828987111 1.016820786 1.216647853 0.71634006)], + [qw(1.12933788 0.831777975 1.005434367 0.991081409 1.126146895 1.07421504 0.69343913 1.054032466 1.14809591 0.728541157)], + [qw(1.124157235 0.828112691 1.022348424 0.983822386 1.143028487 1.081830005 0.672594435 1.05685982 1.149537403 0.684432106)], + [qw(1.128029586 0.841853305 1.00983936 0.967179139 1.122524003 1.094555807 0.659238308 1.061578854 1.1243601 0.740148171)], + [qw(1.093521636 0.855071052 0.929160818 1.203773691 1.178257185 0.881341255 1.078305505 1.051988532 1.169143967 0.555057308)], + [qw(1.073737278 0.877396537 0.968017446 1.124155374 1.166244435 0.909044208 0.999147578 1.071098934 1.120156138 0.607444953)], + [qw(1.092150184 0.863407008 0.927040387 1.185387013 1.171670826 0.882276859 1.083058605 1.048379554 1.168635365 0.580337997)] + ]; +my $DATA_VIR = [ + [2,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [3,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [42,2,'Human (nasal)',[127/255, 127/255, 255/255,1]], + [43,2,'Human (nasal)',[127/255, 127/255, 255/255,1]], + [45,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [49,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [52,3,'Human (sputum)',[127/255, 127/255, 255/255,1]], + [54,3,'Human (sputum)',[127/255, 127/255, 255/255,1]], + [55,4,'Human (sputum, CF)',[127/255, 127/255, 255/255,1]], + [57,4,'Human (sputum, CF)',[127/255, 127/255, 255/255,1]], + [88,5,'Freshwater (Hot spring)',[127/255, 127/255, 255/255,1]], + [89,5,'Freshwater (Hot spring)',[127/255, 127/255, 255/255,1]], + [98,6,'Freshwater (Antartic lake)',[127/255, 127/255, 255/255,1]], + [99,6,'Freshwater (Antartic lake)',[127/255, 127/255, 255/255,1]], + [100,7,'Freshwater (reclaimed)',[127/255, 127/255, 255/255,1]], + [102,7,'Freshwater (reclaimed)',[127/255, 127/255, 255/255,1]], + [153,8,'Mouse (brain tissue)',[127/255, 127/255, 255/255,1]], + [154,8,'Mouse (brain tissue)',[127/255, 127/255, 255/255,1]], + [202,9,'Fish (gut)',[127/255, 127/255, 255/255,1]], + [206,9,'Fish (gut)',[127/255, 127/255, 255/255,1]], + [209,10,'Mosquito',[127/255, 127/255, 255/255,1]], + [211,10,'Mosquito',[127/255, 127/255, 255/255,1]], + ['U',0,'User input',[255/255, 127/255, 127/255,1]] + ]; +my $DATA_MIC = [ + [17,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [20,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [22,1,'Human (fecal)',[127/255, 127/255, 255/255,1]], + [63,2,'Mouse (fecal)',[127/255, 127/255, 255/255,1]], + [65,2,'Mouse (fecal)',[127/255, 127/255, 255/255,1]], + [68,2,'Mouse (fecal)',[127/255, 127/255, 255/255,1]], + [93,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [95,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [109,4,'Marine (open ocean)',[127/255, 127/255, 255/255,1]], + [110,4,'Marine (open ocean)',[127/255, 127/255, 255/255,1]], + [111,4,'Marine (open ocean)',[127/255, 127/255, 255/255,1]], + [120,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [124,5,'Marine (estuary)',[127/255, 127/255, 255/255,1]], + [125,5,'Marine (estuary)',[127/255, 127/255, 255/255,1]], + [134,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [146,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [148,3,'Marine (coastal)',[127/255, 127/255, 255/255,1]], + [201,6,'Fish (gut)',[127/255, 127/255, 255/255,1]], + [203,7,'Fish (slime)',[127/255, 127/255, 255/255,1]], + [205,6,'Fish (gut)',[127/255, 127/255, 255/255,1]], + ['U',0,'User input',[255/255, 127/255, 127/255,1]] + ]; +my $BASE64_BASES = {A => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAAzVJREFUeNrsnMFxo0AQRWe7fJcyMBnYGawyMIe9a0JQJtbefDPOAB33 +JmdgZyBlsIpgl9lCLkwJA/N7uhu0XTXlkstI8Oh+agbG355+/XDC8VaNu8htf1ZjI73DJPx59wCg +EN4phDQkNAsWGqCkIeUM7zFrSL7OBDS+VyObMyQrZWsSUlZnACfw5dwgcZ/5BZPfTEHyEwCvColL +2O24q/uuWUDKJ1TGKpCCsB8Sn4Dl1CGlbvxEBD51SCIlR4lL4VYAUnKB08SzSCSbUkFKLWxRgdMM +sii5wK1BOlksuRSQVoCwA9wjIPDVVCAhWVTWw1SZc0MK8lxHblvUP7fA569TCJyMZFET0qEa75ay +iRtSrDwDlLfG663CPohAQoRdtF4jXrrlFjgZKbU2lN/VeLFSclyQlkAzt6s95BiziVXgXJByFz/7 +WH7x+6OFbOKCFCvL0wUffeUqFYFzQELu7/eVFAKJTeCkmEVDIARXvWqXHAoJEXbwzZ4BZJ/AM21I +iLCLESV50swmMlxqzZ6pnCqkDBD2a0dvlErguRYkiSw6x16zZyKlDy4FwDbjARE4AYBihf1Se0YS +EnRSaSJZpNozxUAKaRv7QNYR/KZSEXgMpI1CFjUhifdMMZBypUzgAB0lcIoAFDv72J6ijY0tuL1P +DckrZ5GrQSM90yYlpMxh9/cfq/GHaSBPq4xeVUBCWWQt/kMaEKNWFQyFJPVAlmRsuCF5N7/wnJCW +TvaBLKkYLHC60iwadWzEWbtzFXgfpNUMhT06CeiKS23wMVKPsNdXAKlX4HTlWTToWG8SQdoxXK3H +zA7E3r0JAr/vmqXogoSu3w87vFeA9AwK3I8pN+Rr/6gAKAQ669m5qoA6hJ0r7mxsoE/Hda4qoA6i +CzDttaJI0TMRc6mFKdqDIqS9w2YtLy4LowTC1o4tdzYR83VaaQASu8Dpwh/ERuzta+441H0am8Cp +1TwuJp5FSQROTB32yRgk9Om4TwI/Q8oc9g9XCmcv2LKJmIRtERL6LfexqoAYSo3r9nUKgb+D7+HP +kFBhW8wi1p6JHL4KujQMCRX4v1UFARJyu2infBky5KIXPYn+rwADAOL8qKxS08x7AAAAAElFTkSu +QmCC', + C => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAA7BJREFUeNrsnM1xqzAQxxUNDfBKwCWQCt7g+7vgEkgJ5pRDTnYJpgRz +eXeYVBBKCCU8SvAzM6sZxuMPaXclQaydYYKTGPBv/7tagdYvp9NJTO3Px6dwZPl5S2A/hdf3rD9v +1eT1nvuC/r7/vvr7SLizDGAUEzgmNr5nN3mt9ksAWNu6cNuQYoCyhX0bpmANoK4K9tlMWrrw0euH +8/YPPkTsQKkxnIv9nNKSZ79BQb5sy3kNkjnnfMMFzsFiUHNDVZVk9FyDTMguBowvGDS8QTpejDpz +tARAZT4gNRr1zZyswYCSrk84Azuahp58MkAqoR9NkjkG0m7BgG5V76yQcgtD/B6mFqvz9nJlW8Pf +uacdha6zI0P6B6YLbGH6UGv+b3tRbnCNpgdwDpuSOEr9cU61AXXUBOX9YlJWolOVS4MwyxnUs2L6 +cAr2G1MhzAKJKu8K1DMw55UKYFHVlFMhYe//KKuZPH7v+CXxGCyQsNZbBjTNUzURUoyFlFEmhhAK +g3BjVDUVWEg5MV90DgvEy3vgppZi66ScGAKurTJMDxXAvXuPPMLGqUYy7T1A6mBLHxSlRg6MMPLT +hOTLWnBuNVELKS9GD5I2ttDzCalkSOJaiTsmKKkVP8wks4qE4xHNKyRKhd0HSCHcyCPb4LDC9g4p +DqFmL9yGZ4EUkrbhBDeYBSWJoKQAKViAFCAFSLOERKl1kqCkoKSgJFMl9QGSPUijpQHSE6rppypJ +tU5Y7Qig3IL1vZ5ydNJ403BcdzSuZBt71Rp4ncxJSbFHSNmN36melxMAK6iQhgWrSWf9wu6KylBL +byiQCo+hliIcqlTmFFLmaZSjOKfCQFIrNLDmuqUrIULqsHO3muhVl+UAxSl3F3lIDQlSHhMZ9XAQ +w9tKqOlAUs2/lBA4OAgz6jlIkDjUlFsEpTqOqGsXeiokqppUfmqYQy+BY1Lz3sPPJg0O1DPkDXSL +5xV1fjEAanVKHZM7kxtG72ObCjN4L9eAoLUQ36SVqwNFcdQ/GWzTUL6V+7aTn5zhqh0dpl/DUYLE +ueZm6lshhHDbEd4Lg8WnmAcBG7H8dZFGqQMDSfWa9QsG1NmGpOS6XiAoVC+vJMb164JCr8TWe9SH +kwOAqmcO6I1SEEvGON/MEI5KC5QWL9bH3KOaVjNSVQXXQ15XLi14TrW0+1r03kIKYGtrlRYvdM0h +dUPlvMI5WQeTyIFXW/Cqeu5VMIPpheUuTZdfobifjDTTXvxYcz5YXsBxtrD+vwADADoA0kx0ZQr1 +AAAAAElFTkSuQmCC', + G => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAA6lJREFUeNrsnN2RqjAUgANjA9wS3CefsQQsAUtgS8AStAQsQUuQEuR5 +nzYlSAkumTnZy/UKJpyTEANn5syqs0L8cn4hIbjf70zI19eaWZS40aT1Pm80Uvhe1ei59b6Ez8hl +tbr+vl5YhpLCa8xx4h54RqCZhCQsI9OwEkYEr2700OgRXqMlNARn3+gN/kbMrrTPXzS6dA2SHFzO +3BBhyd8wrtEhJTAYV+A8Sg7ji8eCJGbpQmHWhkWM7wrJwxqkCODk7L3kpDvmBWJW3sF6+qyfQRY0 +YknvDqgNKjUByRdAUgqVYK4DKQJ/9gWQ/E0FJSQl6gNExIVdo9tGgw5dw/8cDJw/fhXIA8UGN8cW +ZA9ybPVaQ4vEjHDSapgI/qzBDRXjEBUgAeaj0U8EIAl5Dcepidwux7hbQTRTG3ApTmyRa6LOP+3q +M0OFLybIk1fwQ0pmRjhMQEVgTdkQSHsCQBti6+mzVE5gTVqQMmS6l4BqZkckKGymi3UhYQa8tQio +7Xo7gisaSpASZHrdWXCxvrqLI61JqcFNkW52HLmSPmrG0yOA5ezfGw2dxaSI8t9s+GXXjcFMppOp +bj21WgWhoHMyX90tSRCAuAOAZEws4XecdS6LPJOFik9qmq0rsqE6UEic1VyCxExBWiJcrRoh5Y8C +CeNqJfNUKCFVU4GEaUP4DGm2JDQkb63oEVKEyGz1lCCxGZJaMemKiKL2PpJeuiDNme0NLck7SNFU +INUzJLOQ2AzptSxnSLO7kaTyyGdQVJC8drmQsJOPpwJpDt4KkDCXYBPisYmbCgFSuSl3qxHuFk3B +krDWlE0FEiZ4p1OBdEZmuHgKkDjSmrIpQMJaU2Yg0zkJCXtPfz8FSDUSVOwTqL4rk9gtCvnI2Y6s +6e6DRLEg6zRSfBLnvNqAJOST4BwXyxZVMOLtZq8gcUazMOtkIUaJrHozUYKo3C2hWm6cgwtQu5/c +qV2Y6h1VINUMv4C8nfUuoBnyOALOHSzU6GWaQOOBLntmZue2XDLMe4rYpHWVwcbu8XK1uv4uTNXZ +zb1j/z+thkJS1xtj3Tu4W+bxYq22JWEgyZ1APoPaPhbSQ9YCSFC+rbYVE//xLC4OXTAhQR08ASTi +7bqr1AkJDr59YziiUP7zarIplt6cu8zUcTjKu8Gp1idxsCjXg/qB/d1yrzxO6pVuJcyQS6VCBWEh +GNpiBYYfoSiLz/0IYM6gg/rO9qbAwOJzJmVrgd0l3pdEGFXGbUP6EWAA2LwDwtC8jpAAAAAASUVO +RK5CYII=', + T => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAALRJREFUeNrs09ENQDAUQFHEXlhAYgJWMJnEBLqBUWxQFkCC/si5yftq +mzYnaR5jzM4KXXu++J9CNc311YYi022QIEGCBAkSJEiCBCll5c16k+DO4Zj+4dnxmPXj92xvkZYE +SPWLs2uiN/lukCBBggQJkiBBggQJEiRIggQJEiRIkCBBEiRIkCBBggRJkCBBggQJEiRICCBBggQJ +EiRIggQJEiRIkCAJEiRIkCBBggRJ1+0CDAAzsw5U48snWgAAAABJRU5ErkJggg==', + N => 'iVBORw0KGgoAAAANSUhEUgAAAEkAAABJCAYAAABxcwvcAAAACXBIWXMAAAsTAAALEwEAmpwYAAAK +T2lDQ1BQaG90b3Nob3AgSUNDIHByb2ZpbGUAAHjanVNnVFPpFj333vRCS4iAlEtvUhUIIFJCi4AU +kSYqIQkQSoghodkVUcERRUUEG8igiAOOjoCMFVEsDIoK2AfkIaKOg6OIisr74Xuja9a89+bN/rXX +Pues852zzwfACAyWSDNRNYAMqUIeEeCDx8TG4eQuQIEKJHAAEAizZCFz/SMBAPh+PDwrIsAHvgAB +eNMLCADATZvAMByH/w/qQplcAYCEAcB0kThLCIAUAEB6jkKmAEBGAYCdmCZTAKAEAGDLY2LjAFAt +AGAnf+bTAICd+Jl7AQBblCEVAaCRACATZYhEAGg7AKzPVopFAFgwABRmS8Q5ANgtADBJV2ZIALC3 +AMDOEAuyAAgMADBRiIUpAAR7AGDIIyN4AISZABRG8lc88SuuEOcqAAB4mbI8uSQ5RYFbCC1xB1dX +Lh4ozkkXKxQ2YQJhmkAuwnmZGTKBNA/g88wAAKCRFRHgg/P9eM4Ors7ONo62Dl8t6r8G/yJiYuP+ +5c+rcEAAAOF0ftH+LC+zGoA7BoBt/qIl7gRoXgugdfeLZrIPQLUAoOnaV/Nw+H48PEWhkLnZ2eXk +5NhKxEJbYcpXff5nwl/AV/1s+X48/Pf14L7iJIEyXYFHBPjgwsz0TKUcz5IJhGLc5o9H/LcL//wd +0yLESWK5WCoU41EScY5EmozzMqUiiUKSKcUl0v9k4t8s+wM+3zUAsGo+AXuRLahdYwP2SycQWHTA +4vcAAPK7b8HUKAgDgGiD4c93/+8//UegJQCAZkmScQAAXkQkLlTKsz/HCAAARKCBKrBBG/TBGCzA +BhzBBdzBC/xgNoRCJMTCQhBCCmSAHHJgKayCQiiGzbAdKmAv1EAdNMBRaIaTcA4uwlW4Dj1wD/ph +CJ7BKLyBCQRByAgTYSHaiAFiilgjjggXmYX4IcFIBBKLJCDJiBRRIkuRNUgxUopUIFVIHfI9cgI5 +h1xGupE7yAAygvyGvEcxlIGyUT3UDLVDuag3GoRGogvQZHQxmo8WoJvQcrQaPYw2oefQq2gP2o8+ +Q8cwwOgYBzPEbDAuxsNCsTgsCZNjy7EirAyrxhqwVqwDu4n1Y8+xdwQSgUXACTYEd0IgYR5BSFhM +WE7YSKggHCQ0EdoJNwkDhFHCJyKTqEu0JroR+cQYYjIxh1hILCPWEo8TLxB7iEPENyQSiUMyJ7mQ +AkmxpFTSEtJG0m5SI+ksqZs0SBojk8naZGuyBzmULCAryIXkneTD5DPkG+Qh8lsKnWJAcaT4U+Io +UspqShnlEOU05QZlmDJBVaOaUt2ooVQRNY9aQq2htlKvUYeoEzR1mjnNgxZJS6WtopXTGmgXaPdp +r+h0uhHdlR5Ol9BX0svpR+iX6AP0dwwNhhWDx4hnKBmbGAcYZxl3GK+YTKYZ04sZx1QwNzHrmOeZ +D5lvVVgqtip8FZHKCpVKlSaVGyovVKmqpqreqgtV81XLVI+pXlN9rkZVM1PjqQnUlqtVqp1Q61Mb +U2epO6iHqmeob1Q/pH5Z/YkGWcNMw09DpFGgsV/jvMYgC2MZs3gsIWsNq4Z1gTXEJrHN2Xx2KruY +/R27iz2qqaE5QzNKM1ezUvOUZj8H45hx+Jx0TgnnKKeX836K3hTvKeIpG6Y0TLkxZVxrqpaXllir +SKtRq0frvTau7aedpr1Fu1n7gQ5Bx0onXCdHZ4/OBZ3nU9lT3acKpxZNPTr1ri6qa6UbobtEd79u +p+6Ynr5egJ5Mb6feeb3n+hx9L/1U/W36p/VHDFgGswwkBtsMzhg8xTVxbzwdL8fb8VFDXcNAQ6Vh +lWGX4YSRudE8o9VGjUYPjGnGXOMk423GbcajJgYmISZLTepN7ppSTbmmKaY7TDtMx83MzaLN1pk1 +mz0x1zLnm+eb15vft2BaeFostqi2uGVJsuRaplnutrxuhVo5WaVYVVpds0atna0l1rutu6cRp7lO +k06rntZnw7Dxtsm2qbcZsOXYBtuutm22fWFnYhdnt8Wuw+6TvZN9un2N/T0HDYfZDqsdWh1+c7Ry +FDpWOt6azpzuP33F9JbpL2dYzxDP2DPjthPLKcRpnVOb00dnF2e5c4PziIuJS4LLLpc+Lpsbxt3I +veRKdPVxXeF60vWdm7Obwu2o26/uNu5p7ofcn8w0nymeWTNz0MPIQ+BR5dE/C5+VMGvfrH5PQ0+B +Z7XnIy9jL5FXrdewt6V3qvdh7xc+9j5yn+M+4zw33jLeWV/MN8C3yLfLT8Nvnl+F30N/I/9k/3r/ +0QCngCUBZwOJgUGBWwL7+Hp8Ib+OPzrbZfay2e1BjKC5QRVBj4KtguXBrSFoyOyQrSH355jOkc5p +DoVQfujW0Adh5mGLw34MJ4WHhVeGP45wiFga0TGXNXfR3ENz30T6RJZE3ptnMU85ry1KNSo+qi5q +PNo3ujS6P8YuZlnM1VidWElsSxw5LiquNm5svt/87fOH4p3iC+N7F5gvyF1weaHOwvSFpxapLhIs +OpZATIhOOJTwQRAqqBaMJfITdyWOCnnCHcJnIi/RNtGI2ENcKh5O8kgqTXqS7JG8NXkkxTOlLOW5 +hCepkLxMDUzdmzqeFpp2IG0yPTq9MYOSkZBxQqohTZO2Z+pn5mZ2y6xlhbL+xW6Lty8elQfJa7OQ +rAVZLQq2QqboVFoo1yoHsmdlV2a/zYnKOZarnivN7cyzytuQN5zvn//tEsIS4ZK2pYZLVy0dWOa9 +rGo5sjxxedsK4xUFK4ZWBqw8uIq2Km3VT6vtV5eufr0mek1rgV7ByoLBtQFr6wtVCuWFfevc1+1d +T1gvWd+1YfqGnRs+FYmKrhTbF5cVf9go3HjlG4dvyr+Z3JS0qavEuWTPZtJm6ebeLZ5bDpaql+aX +Dm4N2dq0Dd9WtO319kXbL5fNKNu7g7ZDuaO/PLi8ZafJzs07P1SkVPRU+lQ27tLdtWHX+G7R7ht7 +vPY07NXbW7z3/T7JvttVAVVN1WbVZftJ+7P3P66Jqun4lvttXa1ObXHtxwPSA/0HIw6217nU1R3S +PVRSj9Yr60cOxx++/p3vdy0NNg1VjZzG4iNwRHnk6fcJ3/ceDTradox7rOEH0x92HWcdL2pCmvKa +RptTmvtbYlu6T8w+0dbq3nr8R9sfD5w0PFl5SvNUyWna6YLTk2fyz4ydlZ19fi753GDborZ752PO +32oPb++6EHTh0kX/i+c7vDvOXPK4dPKy2+UTV7hXmq86X23qdOo8/pPTT8e7nLuarrlca7nuer21 +e2b36RueN87d9L158Rb/1tWeOT3dvfN6b/fF9/XfFt1+cif9zsu72Xcn7q28T7xf9EDtQdlD3YfV +P1v+3Njv3H9qwHeg89HcR/cGhYPP/pH1jw9DBY+Zj8uGDYbrnjg+OTniP3L96fynQ89kzyaeF/6i +/suuFxYvfvjV69fO0ZjRoZfyl5O/bXyl/erA6xmv28bCxh6+yXgzMV70VvvtwXfcdx3vo98PT+R8 +IH8o/2j5sfVT0Kf7kxmTk/8EA5jz/GMzLdsAAAAgY0hSTQAAeiUAAICDAAD5/wAAgOkAAHUwAADq +YAAAOpgAABdvkl/FRgAAAh1JREFUeNrsmk1xwzAQRr8RgYRBwqBhkDJoGbQMagZ1GbgMVAYNA5dB +wsBm4CBwL9Wx0Uwk7593Z3z0SHmRn3fXi3me8d8FoAUw33kdQB/9PXu9xWCeZ4QFN9zBSCwJ6Qig +cUj5aAFsHdLt2Fh47ALBGi8AHh2ScYlTQXrQLPFAuJZaiVNC2gCIDikfTxolHhjWjA4pH7s/Pzmk +TDQA9g7JUCYeGNdWI/HAvH50SEYkHgTs4V26xIOQfUSHlI8jgGeHlI9OagEsCdIOQtspQdh+REo8 +CPzjokNSKPGlIJ0qnKatdUgdgJ/CArhdw+NW+qZ6A888ASmkM4DPCifSvLhbANdCib9ahzRV+JHs +mThFCvCtXeJUeVLpaWKVOBWkAcCH1kycMuPuAIwF97PNE1BCqiHxlkPi1LVbX1iysHyK4ihwm8Lc +iXwojAPSUOE0dNYhJbdctEics5/UVAC9tQ6pB/BVKPFoHVINiZPME3BDmirUZdE6pPSmKimAF58n +kPIhoKlw/946pDPKupiLZuKSPim1FSR+sA6pRgG8sQ4JKO9iYg2QAAGNfw2QBpR3Mc1DSrnT6JCW +l7h5SKkAPjmk5QvgVUAaIGAeQDqklImPDkl47qQFUo+yLuYqILFKXBOkCUzTJZogpUz84pAESlwj +pDPKZzHNQ0q509Uh5SXeOKR8RBB1MTVDIpO4dkgDCLqY2iGl3Gl0SMwS/x0AsYSfWCRqIfIAAAAA +SUVORK5CYII=' + }; +my $MMCHART_B2 = 'iVBORw0KGgoAAAANSUhEUgAAAAEAAAAGCAYAAAACEPQxAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAAABdJREFUeNpiYGBg+M/w//9/BmwEQIABANxBD/HRDNRSAAAAAElFTkSu +QmCC'; +my $FREQCHART_L = 'iVBORw0KGgoAAAANSUhEUgAAAC8AAABvCAIAAADzHQ6XAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJ +bWFnZVJlYWR5ccllPAAABWNJREFUeNrkm91V7DoMhTOsaQBKgBKgBCjhUgKUAI/wBiUwJUAJUAKU +ACVACed+K/tcYxz5J5nYk3uOHmYFxokVWdrakj17v1rJ6+trdkw3y0x3d3dd172/v5vfPj8/d/8J +Iytq488U0+bw8PDq6oqLh4cHhn1+fta1jXQytWGB+ErLhB5cPz4+xp6z11WWr68vPvf39/WJnT4+ +PmKD17W10dwoYX57c3Oji9vbW2xTXRvpgYVknoQ2fFZfKSkhC6ETFzE7tdDm+PiY6V9eXrh+enri +8/T0NDq6RkxxwZ/EczneSJNuIVgsbVa62rmsVqsfeHN5eanV3aU4W8n5cTqWNobx9ST0G2AbPZzD +c+HcsJk2ht8ACYTiSy8Y7J9eUmG5hQRYbMQUy+SMhDasnUyVyL1NV4rgRCcs1E6btBdDUPiqnTYX +FxeA5q8dyZxYXMJ5W2jDmrqcHHMsTO7GxFDD0OauF/cI1i4bR4yRPylTmixTMOHIvPlMafMjM1xf +X/tgAOqcnZ2JSsaEMcwkVoVab29vJhlVqOozwUQ7n0sHb8br8lqZmsMzPnZivuEYgWc6MKXJ2mev +AeAe9pKwTZrzOkHdo14Yr9ceYvEP24gQDUODmbK2cRY1bYOX8BDZj1kwtpmSQy8+7oXbPnvhTiwc +u9n3UKcut6uK80UVXfb1Qm0wTGDzrNP49ojFlJZGz9EKmEFu4w1Dr3rh/pI0aeJNwIt5lCtfhsZb +LhYvixev/XC9v78fwpdPM2rLtzbALgqJW+2YpRMLXO+cUayDKrWxLWwsFnZl0aWdbYAEcjiuE7Q2 +wLdsJprZb9QDMyXRGKtYT8VytdkFql6H7/eCThR1akVJSp5lkqyt6nAWSyxO2YQFwq+zkF/Cixnj +orU0T4k/EFYiBjwCCOY/6dxZwot5DmN4jskwDW38cY6mcH+WVGT5TbZHbLB0k4mqvZtm6X6z02Tp +6gltNptVL6TCfJ5ynUsfjtXCTDhyCS/WyzCGV0cnKhOmcK+dwmJEzovBHRNN+02WFysDuofElt5g +okGe4s7semf9Rh7p8nGpNtOYaJYXi5lrjBh7Uc0wYx0e8GIfb0bU4QvixScnJ2ZQtGGiYZ4ys08a +bCr2i83dSnyiavNxnN80I4QGL06AaRte/O3FQStJfkOWAUIc06jtxSkm6naOF8FE23BQ3zZL3Z/a +Rubixd8xFcNiJ/gybjQkUxTwDrWHA3wydHR0RJ4i0eb3fMWtxmIx7MmdkSAqE+PPz89HYPGwI6ee +VAKLy89ICNZLc3jA0AqxONh4jnUY1XfWFGlt1sHSDtsU2/eLWUEVOiN6FGrN8wbq0GJ/uVuiqSPb +OIuatsEYSr0BBRvdoc0eJSrxm2EETeTFJYzO2SN77micbbY/KeVe2px4ijbcg7XH7s7Pv3c3rUcx +bw4Pz1GoR6GiGpDI7pZV4cWTexRVbDO5R1Gllz6tR1G3XzyhR1GrgpnWo6h74mWz2ciHUAVQb0aN +F8yLcVgdxtstL86wrWl5akj8Rp/3U0CN1SZ7dli4qmSpMWa2Cb2YzMBiDelIYg+GBaLSUDCy1gcH +B8MymTF+qOIfZtmQr3yzezCFvNhkZyleLJQbuwdTyItdoGB+LOebKorFk6lWmhf7NNc8gWIzinpn +h3EdmAmq+AEYPf/rVwVjgzzLi0VLsp2X3xHu++kE1MnyYh9s8jXDltrMy4vn6ZjMtZG+1y1JlqXN +KrBzAOSNO/vrYO7GRymi/eI/pwv5Z3rx36pNEXduUCst63cwJb+7a6RNYU95zqyZ3Wz7n/3uTgDY +tXHhEu7cYqX++t/dbY83vyOr2WnHEu78rwADABaBbeIZChwYAAAAAElFTkSuQmCC'; + +my $CSS_STYLE = ' +html, body, div, span, p, img { + margin: 0; + padding: 0; + border: 0; + outline: 0; + font-size: 100%; + vertical-align: baseline; + background: transparent; +} + +html, body { + font-family: Arial, Verdana; + color: #40454b; + font-size: 12px; + text-align: center; +} + +img { + padding: 0px; margin: 0px; border: none; +} + +.info-panel { + margin-top: 10px; + margin-bottom: 10px; + width: 740px; + text-align: left; +} + +.info-header { + padding-top: 20px; + padding-top: 10px; +} + +.info-header-title { + color: #126499; + text-decoration: none; + font-family: sans-serif; + font-weight: bold; + font-size: 16px; + vertical-align: baseline; + margin-right: 20px; + margin-bottom: 25px; + margin-top: 15px; +} + +.info-content { + padding: 2px; + font-family: "lucida grande",sans-serif,arial; + margin-top: 15px; + margin-bottom: 15px; +} + +.info-table-type { + min-width: 70px; + padding: 4px; + vertical-align: top; +} + +.info-table-value { + font-weight: bold; + padding-top: 4px; + padding-left: 10px; + padding-right: 10px; + vertical-align: top; +} + +hr { + background-color: #E0E0E0; + border: medium none; + color: #E0E0E0; + height: 1px; + outline: medium none; +} + +.sequencetext { + font-family: courier, "courier new"; + font-weight: normal; +} +'; + +my $VERSION = '0.6'; +my $WHAT = 'graphs'; + +my $man = 0; +my $help = 0; +my %params = ('help' => \$help, 'h' => \$help, 'man' => \$man); +GetOptions( \%params, + 'help|h', + 'man', + 'verbose', + 'version' => sub { print "PRINSEQ-$WHAT $VERSION\n"; exit; }, + 'i=s', + 'o=s', + 'png_all', + 'html_all', + 'log:s', + 'web:s' + ) or pod2usage(2); +pod2usage(1) if $help; +pod2usage(-exitstatus => 0, -verbose => 2) if $man; + +=head1 NAME + +PRINSEQ - PReprocessing and INformation of SEQuence data + +=head1 VERSION + +PRINSEQ-graphs 0.6 + +=head1 SYNOPSIS + +perl prinseq-graphs.pl [-h] [-help] [-version] [-man] [-verbose] [-i input_graph_data_file] [-png_all] [-html_all] [-log file] + +=head1 DESCRIPTION + +PRINSEQ will help you to preprocess your genomic or metagenomic sequence data in FASTA (and QUAL) or FASTQ format. The graphs version allows users of the lite version to generate graphs similar to the web version. + +=head1 OPTIONS + +=over 8 + +=item B<-help> | B<-h> + +Print the help message; ignore other arguments. + +=item B<-man> + +Print the full documentation; ignore other arguments. + +=item B<-version> + +Print program version; ignore other arguments. + +=item B<-verbose> + +Prints status and info messages during processing. + +=item B<***** INPUT OPTIONS *****> + +=item B<-i> + +Input file containing the graph data generated by the lite version. + +=item B<***** OUTPUT OPTIONS *****> + +=item B<-o> + +By default, the output files are created in the same directory as the input file with an additional "_prinseq_graphs_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option. The file extension will be added automatically. + +=item B<-png_all> + +Use this option to generate PNG files with the graphs. + +=item B<-html_all> + +Use this option to generate a HTML file with the graphs and tables. + +=item B<-log> + +Log file to keep track of parameters, errors, etc. The log file name is optional. If no file name is given, the log file name will be "inputname.log". If the log file already exists, new content will be added to the file. + +=back + +=head1 AUTHOR + +Robert SCHMIEDER, C<< >> + +=head1 BUGS + +If you find a bug please email me at C<< >> or use http://sourceforge.net/tracker/?group_id=315449 so that I can make PRINSEQ better. + +=head1 COPYRIGHT + +Copyright (C) 2011-2012 Robert SCHMIEDER + +=head1 LICENSE + +This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. + +You should have received a copy of the GNU General Public License along with this program. If not, see . + +=cut + +# +################################################################################ +## DATA AND PARAMETER CHECKING +################################################################################ +# + +my ($file1,$command,@dataread); + +#Check if input file exists and check if file format is correct +if(exists $params{i}) { + $command .= ' -i '.$params{i}; + $file1 = $params{i}; + if($params{i} eq 'stdin') { + my $format = &checkInputFormat(); + unless($format eq 'gd') { + &printError('input data for -i is in '.uc($format).' format not in graph data format'); + } + } elsif(-e $params{i}) { + #check for file format + my $format = &checkFileFormat($file1); + unless($format eq 'gd') { + &printError('input file for -i is in '.uc($format).' format not in graph data format'); + } + } else { + &printError("could not find input file \"".$params{i}."\""); + } +} else { + &printError("you did not specify an input file containing the graph data"); +} + +#check output file name prefix +if(exists $params{o}) { + $command .= ' -o '.$params{o}; +} + +#check for output format +unless(exists $params{png_all} || exists $params{html_all}) { + &printError("No output format specified. Use -png_all and/or -html_all to generate graphs."); +} +if(exists $params{png_all}) { + $command .= ' -png_all'; +} +if(exists $params{html_all}) { + $command .= ' -html_all'; +} +if(exists $params{web}) { + $command .= ' -web'.($params{web} ? ' '.$params{web} : ''); +} + +#add remaining to log command +if(exists $params{log}) { + $command .= ' -log'.($params{log} ? ' '.$params{log} : ''); + + unless($params{log}) { + $params{log} = join("__",$file1||'nonamegiven').'.log'; + } + $params{log} = cwd().'/'.$params{log} unless($params{log} =~ /^\//); + &printLog("Executing PRINSEQ with command: \"perl prinseq-".$WHAT.".pl".$command."\""); +} + +# +################################################################################ +## DATA PROCESSING +################################################################################ +# + +my $filename = $file1; +while($filename =~ /[\w\d]+\.[\w\d]+$/) { + $filename =~ s/\.[\w\d]+$//; + last if($filename =~ /\/[^\.]+$/); +} + +if(exists $params{png_all}) { + my $graphs = &generateGraphs($params{i},$params{o}); + if(exists $params{web} && $params{web} ne 'nozip') { + #png files + if(scalar(@$graphs)) { + system("zip -j -r ".dirname($params{o})."/png_graphs.zip ".dirname($params{o}).' -i \*.png') == 0 or &printError("Cannot generate graphs ZIP file"); + } + } +} +if(exists $params{html_all}) { + &generateHtml($params{i},$params{o}); +} + +&printWeb("STATUS: done"); + +## +################################################################################# +### MISC FUNCTIONS +################################################################################# +## + +sub printError { + my $msg = shift; + print STDERR "\nERROR: ".$msg.".\n\nTry \'perl prinseq-".$WHAT.".pl -h\' for more information.\nExit program.\n"; + &printLog("ERROR: ".$msg.". Exit program.\n"); + exit(0); +} + +sub printWarning { + my $msg = shift; + print STDERR "WARNING: ".$msg.".\n"; + &printLog("WARNING: ".$msg.".\n"); +} + +sub printWeb { + my $msg = shift; + if(exists $params{web}) { + print STDERR "\n".&getTime()."$msg\n"; + } +} + +sub getTime { + return sprintf("[%02d/%02d/%04d %02d:%02d:%02d] ",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); +} + +sub printLog { + my $msg = shift; + if(exists $params{log}) { + my $time = sprintf("%02d/%02d/%04d %02d:%02d:%02d",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); + open(FH, ">>", $params{log}) or die "ERROR: Can't open file ".$params{log}.": $! \n"; + flock(FH, LOCK_EX) or die "ERROR: Cannot lock file ".$params{log}.": $! \n"; + print FH "[prinseq-".$WHAT."-$VERSION] [$time] $msg\n"; + flock(FH, LOCK_UN) or die "ERROR: cannot unlock ".$params{log}.": $! \n"; + close(FH); + } +} + +sub addCommas { + my $num = shift; + return unless(defined $num); + return $num if($num < 1000); + $num = scalar reverse $num; + $num =~ s/(\d{3})/$1\,/g; + $num =~ s/\,$//; + $num = scalar reverse $num; + return $num; +} + +sub checkFileFormat { + my $file = shift; + + my ($format,$count,$id,$fasta,$fastq,$qual,$gd,$aa); + $count = 3; + $fasta = $fastq = $qual = $gd = $aa = 0; + $format = 'unknown'; + + open(FILE,"perl -p -e 's/\r/\n/g;s/\n\n/\n/g' < $file |") or die "ERROR: Could not open file $file: $! \n"; + while () { +# chomp(); + # next unless(length($_)); + if($count-- == 0) { + last; + } elsif(!$fasta && /^\>\S+\s*/) { + $fasta = 1; + $qual = 1; + } elsif($fasta == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fasta = 2; + } elsif($qual == 1 && /^\s*\d+/) { + $qual = 2; + } elsif(!$fastq && /^\@(\S+)\s*/) { + $id = $1; + $fastq = 1; + } elsif($fastq == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fastq = 2; + } elsif($fastq == 2 && /^\+(\S*)\s*/) { + $fastq = 3 if($id eq $1 || /^\+\s*$/); + } elsif(!$gd && /^\{\"numseqs\"\:/) { + $gd = 1; + } + } + close(FILE); + if($fasta == 2) { + $format = 'fasta'; + } elsif($qual == 2) { + $format = 'qual'; + } elsif($fastq == 3) { + $format = 'fastq'; + } elsif($gd == 1) { + $format = 'gd'; + } + + return $format; +} + +sub checkInputFormat { + my ($format,$count,$id,$fasta,$fastq,$qual,$gd,$aa); + $count = 3; + $fasta = $fastq = $qual = $gd = $aa = 0; + $format = 'unknown'; + + while () { + push(@dataread,$_); +# chomp(); + # next unless(length($_)); + if($count-- == 0) { + last; + } elsif(!$fasta && /^\>\S+\s*/) { + $fasta = 1; + $qual = 1; + } elsif($fasta == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fasta = 2; + } elsif($qual == 1 && /^\s*\d+/) { + $qual = 2; + } elsif(!$fastq && /^\@(\S+)\s*/) { + $id = $1; + $fastq = 1; + } elsif($fastq == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/))) { + $fastq = 2; + } elsif($fastq == 2 && /^\+(\S*)\s*/) { + $fastq = 3 if($id eq $1 || /^\+\s*$/); + } elsif(!$gd && /^\{\"numseqs\"\:/) { + $gd = 1; + } + } + + if($fasta == 2) { + $format = 'fasta'; + } elsif($qual == 2) { + $format = 'qual'; + } elsif($fastq == 3) { + $format = 'fastq'; + } elsif($gd == 1) { + $format = 'gd'; + } + + return $format; +} + +sub readGdFile { + my $file = shift; + my $data; + + open(DATA,"<$file") or &printError("Could not open file $file: $!"); + while() { + next if(/^\#/); + chomp(); + if(length($_)) { + $data = from_json($_); + } + } + close(DATA); + + return $data; +} + +sub getFileName { + my $ext = shift; + my ($file,$fh); + if(exists $params{o}) { + $file = $params{o}.$ext; + open(OUT,">$file") or &printError('cannot open output file'); + close(OUT); + } else { + $fh = File::Temp->new( TEMPLATE => $filename.'_prinseq_graphs_XXXX', + SUFFIX => $ext, + UNLINK => 0); + $file = $fh->filename; + $fh->close(); + } + return $file; +} + +sub generateGraphs { + my ($in,$out) = @_; + my ($file,$data,$surface,@graphs); + $data = &readGdFile($in); + + #length plot + if(exists $data->{counts}->{length}) { + $file = &getFileName('_ld.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{length},1),$data->{stats}->{length},'Length Distribution','Read Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{counts2} && exists $data->{counts2}->{length}) { + $file = &getFileName('_ld-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{length},1),$data->{stats2}->{length},'Length Distribution','Read Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #tail plot + if(exists $data->{tail}) { + $file = &getFileName('_td5.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + $file = &getFileName('_td3.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{tail2}) { + $file = &getFileName('_td5-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + $file = &getFileName('_td3-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Ns plot + if(exists $data->{counts}->{ns}) { + $file = &getFileName('_ns.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{counts2} && exists $data->{counts2}->{ns}) { + $file = &getFileName('_ns-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #GC content plot + if(exists $data->{counts}->{gc}) { + $file = &getFileName('_gc.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{gc},0),$data->{stats}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{counts2} && exists $data->{counts2}->{gc}) { + $file = &getFileName('_gc-2.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{gc},0),$data->{stats2}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Sequence complexity plot - dust + if(exists $data->{compldust}) { + $file = &getFileName('_cd.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{compldust},0),undef,'Sequence complexity distribution','Mean sequence complexity (DUST scores)','Number of sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Sequence complexity plot - entropy + if(exists $data->{complentropy}) { + $file = &getFileName('_ce.png'); + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{complentropy},0),undef,'Sequence complexity distribution','Mean sequence complexity (Entropy values)','Number of sequences',$file,1); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Dinucleotide odd ratio PCA plot - microbial/viral + #Odds ratio plot + if(exists $data->{dinucodds}) { + my @new = map {$data->{dinucodds}->{$_}} sort keys %{$data->{dinucodds}}; + $file = &getFileName('_pm.png'); + $surface = &createPCAPlot(&convertToPCAValues(\@new,'m'),'PCA','1st Principal Component Score','2nd Principal Component Score',$file); + $surface->write_to_png($file); + push(@graphs,$file); + $file = &getFileName('_pv.png'); + $surface = &createPCAPlot(&convertToPCAValues(\@new,'v'),'PCA','1st Principal Component Score','2nd Principal Component Score',$file); + $surface->write_to_png($file); + push(@graphs,$file); + $file = &getFileName('_or.png'); + $surface = &createOddsRatioPlot($data->{dinucodds},'Odds ratios','Dinucleotide','Odds ratio',$file); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Qual plot + if(exists $data->{quals}) { + $file = &getFileName('_qd.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{quals},4),'Base Quality Distribution','Read position in %','Quality score',$file); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{quals2}) { + $file = &getFileName('_qd-2.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{quals2},4),'Base Quality Distribution','Read position in %','Quality score',$file); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Qualbin plot + if(exists $data->{qualsbin}) { + $file = &getFileName('_qd2.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin},4),'Base Quality Distribution','Read position in bp','Quality score',$file,0,'bp',$data->{binval}); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{qualsbin2}) { + $file = &getFileName('_qd2-2.png'); + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin2},4),'Base Quality Distribution','Read position in bp','Quality score',$file,0,'bp',$data->{binval}); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Qualmean plot + if(exists $data->{qualsmean}) { + $file = &getFileName('_qd3.png'); + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{qualsmean2}) { + $file = &getFileName('_qd3-2.png'); + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean2},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + + #Sequence duplicate plots + if(exists $data->{dubscounts}) { + $file = &getFileName('_df.png'); + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubscounts},5,1,100),'Sequence duplication level','Number of duplicates','Number of sequences',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{dubslength}) { + $file = &getFileName('_dl.png'); + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubslength},5,1),'Sequence duplication level','Read Length in bp','Number of duplicates',$file,0,' bp'); + $surface->write_to_png($file); + push(@graphs,$file); + } + if(exists $data->{dubscounts}) { + my %dubsmax; + my $count = 1; + foreach my $n (sort {$b <=> $a} keys %{$data->{dubscounts}}) { + foreach my $s (keys %{$data->{dubscounts}->{$n}}) { + foreach my $i (1..$data->{dubscounts}->{$n}->{$s}) { + $dubsmax{$count++}->{$s} = $n; + last unless($count <= 100); + } + last unless($count <= 100); + } + last unless($count <= 100); + } + $file = &getFileName('_dm.png'); + $surface = &createStackBarPlot(&convertOdToStackBinMatrix(\%dubsmax,5,1,100),'Sequence duplication level','Sequence','Number of duplicates',$file,0); + $surface->write_to_png($file); + push(@graphs,$file); + } + + return \@graphs; +} + +sub convertOdToBinMatrix { + my ($data,$min,$max,$nonice) = @_; + + my ($num,$ymax,$xmax,$xmin,$step,%vals,$tmp,@matrix,$bin,$tmpbin); + + #make nice xmax value + if(defined $max) { + $xmax = $max; + } else { + $xmax = (sort {$b <=> $a} keys %$data)[0]; + } + $bin = &getBinVal($xmax); + $xmax = $bin*100; + $xmin = (defined $min ? $min : 0); + + #get data to bin and find y axis max value + $ymax = 0; + $tmp = 0; + $tmpbin = $bin; + foreach my $i ($xmin..$xmax) { + if(exists $data->{$i}) { + $tmp += $data->{$i}; + } + if(--$tmpbin <= 0) { + $tmpbin = $bin; + $ymax = &max($ymax,$tmp); + push(@matrix,$tmp); + $tmp = 0; + } + } + + #make nice ymax value + unless($nonice) { + $ymax = sprintf("%d",($ymax/4)+1)*4 if($ymax % 4); +# $step = ($ymax <= 10 ? 10 : ($ymax < 40 ? 40 : ($ymax < 100 ? 100 : ($ymax < 1000 ? 100 : 100)))); +# $ymax = sprintf("%d",($ymax/$step)+1)*$step if($ymax % $step); + } + + return (\@matrix,$xmax,$ymax); +} + +sub getBinVal { + my $val = shift; + my $step; + if(!$val || $val <= 100) { + return 1; + } elsif($val < 10000) { + return int($val/100)+($val % 100 ? 1 : 0); + } elsif($val < 100000) { + return 1000; + } else { + $step = 1000000; + my $xmax = ($val % $step ? sprintf("%d",($val/$step+1))*$step : $val); + return ($xmax/100); + } +} + +sub max { + my ($a,$b) = @_; + return ($a < $b ? $b : $a); +} + +sub min { + my ($a,$b) = @_; + return ($a > $b ? $b : $a); +} + +sub createAnnotBarPlot { + my ($matrix,$xmax,$ymax,$annot,$title,$xlab,$ylab,$file,$zero,$add) = @_; + + my $bin = 1; + if($xmax > 100) { + $bin = $xmax / 100; + $xmax = 100; + } + + my @barcol = (127/255, 127/255, 255/255, 1); #b2b2ff + my @meancol = (255/255, 127/255, 127/255, 1); #ffb2b2 + my @stdcol = (178/255, 178/255, 255/255, 0.8); #7f7fff + my @std1col = (0, 0, 0, 0.04); #ff7f7f + my @std2col = (0, 0, 0, 0.03); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @helplinecol = (1, 1, 1, 0.9); + my @background = (0.95, 0.95, 0.95, 1); + my @tickcol = (0, 0, 0, 0.8); + my @labelcol = (0, 0, 0, 1); + + #create new image + my $size = 6; + my $offset = 20; + my $left = 40; + my $bottom = 15; + my $top = 20; + my $height = 200; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$top+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); + + $cr->save; + + #set up work space + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(@background); + $cr->fill; + + #draw ticks + #x-axis + $cr->set_source_rgba(@tickcol); + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); + } else { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+1); + } + } + $cr->stroke; + + #y-axis + $cr->move_to($left+$offset, $top+$offset); + $cr->line_to($left+$offset-3, $top+$offset); + $cr->move_to($left+$offset, $top+$offset+$height-1); + $cr->line_to($left+$offset-3, $top+$offset+$height-1); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(@tickcol); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%10) == 0 && $i > 1 && $i < $xmax) { + $extents = $cr->text_extents($i*$bin); + $cr->move_to($left+$offset+int($size/2+1)+$size*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i*$bin); + } + } + #y-axis + $extents = $cr->text_extents(&addCommas($ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2); + $cr->show_text(&addCommas($ymax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + $cr->set_source_rgba(@labelcol); + $extents = $cr->text_extents($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab.($bin>1 ? ' (per bin)' : '')); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset+10); + $cr->show_text($ylab.($bin>1 ? ' (per bin)' : '')); + + $cr->restore; + + #draw annotations + if($annot) { + $cr->set_antialias('none'); + my ($std1l,$std2l,$std1r,$std2r); + #std boxes + $std1l = int($annot->{mean})-int($annot->{std}); + $std2l = int($annot->{mean})-2*int($annot->{std}); + $std1r = int($annot->{mean})+int($annot->{std}); + $std2r = int($annot->{mean})+2*int($annot->{std}); + unless($std1l == $std1r) { + if($std1l < 0) { + $std1l = 0; + } else { + $std1l = int($std1l/$bin); + } + if($std2l < 0) { + $std2l = 0; + } else { + $std2l = int($std2l/$bin); + } + if($std1r/$bin > 100) { + $std1r = 100; + } else { + $std1r = int($std1r/$bin); + } + if($std2r/$bin > 100) { + $std2r = 100; + } else { + $std2r = int($std2r/$bin); + } + $cr->rectangle($left+$offset+$std2l*$size+2, $top+$offset, ($std2r-$std2l)*$size, $height); + $cr->set_source_rgba(@std2col); + $cr->fill; + $cr->rectangle($left+$offset+$std1l*$size+2, $top+$offset, ($std1r-$std1l)*$size, $height); + $cr->set_source_rgba(@std1col); + $cr->fill; + #mean line + $cr->set_source_rgba(@meancol); + $cr->move_to($left+$offset+int(int($annot->{mean})/$bin)*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+int(int($annot->{mean})/$bin)*$size+2, $top+$offset+$height); + $cr->stroke; + #std lines + $cr->set_source_rgba(@stdcol); + if($std1l > 0) { + $cr->move_to($left+$offset+$std1l*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std1l*$size+2, $top+$offset+$height); + } + if($std2l > 0) { + $cr->move_to($left+$offset+$std2l*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std2l*$size+2, $top+$offset+$height); + } + if($std1r < 100) { + $cr->move_to($left+$offset+$std1r*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std1r*$size+2, $top+$offset+$height); + } + if($std2r < 100) { + $cr->move_to($left+$offset+$std2r*$size+2, $top+$offset-5); + $cr->line_to($left+$offset+$std2r*$size+2, $top+$offset+$height); + } + $cr->stroke; + #labels + $cr->set_antialias('default'); + $cr->set_source_rgba(@tickcol); + $extents = $cr->text_extents('M'); + $cr->move_to($left+$offset+int(int($annot->{mean})/$bin)*$size+2-$extents->{width}/2, $top+$offset-10); + $cr->show_text('M'); + if($std1l > 0) { + $extents = $cr->text_extents('1SD'); + $cr->move_to($left+$offset+$std1l*$size-$extents->{width}/2+2, $top+$offset-10); + $cr->show_text('1SD'); + } + if($std2l > 0) { + $extents = $cr->text_extents('2SD'); + $cr->move_to($left+$offset+$std2l*$size-$extents->{width}/2+3, $top+$offset-10); + $cr->show_text('2SD'); + } + if($std1r < 100) { + $extents = $cr->text_extents('1SD'); + $cr->move_to($left+$offset+$std1r*$size-$extents->{width}/2+2, $top+$offset-10); + $cr->show_text('1SD'); + } + if($std2r < 100) { + $extents = $cr->text_extents('2SD'); + $cr->move_to($left+$offset+$std2r*$size-$extents->{width}/2+3, $top+$offset-10); + $cr->show_text('2SD'); + } + } + } + + #draw boxes + $cr->set_antialias('none'); + $cr->set_source_rgba(@barcol); + foreach my $pos (0..$xmax-($zero ? 0 : 1)) { + next unless($matrix->[$pos]); + my $tmp = $matrix->[$pos] / $ymax; + #unique + if($tmp) { + $cr->rectangle($left+$offset+$pos*$size, $top+$offset+$height, $size-1, -$tmp*$height); + $cr->fill; + } + } + + #write image + $cr->show_page; + return $surface; +} + +sub convertToPCAValues { + my ($new,$type) = @_; + + my @data = ($type eq 'v' ? @$DINUCODDS_VIR : @$DINUCODDS_MIC); + + push(@data,$new); + + my $pca = Statistics::PCA->new; + + #suppress output from PCA module + my $output = ''; + open(TOOUTPUT, '>', \$output) or &printError("Can't open TOOUTPUT: $!"); + select TOOUTPUT; + + $pca->load_data({format => 'table', data => \@data}); + $pca->pca(); + + my @variances = $pca->results('proportion'); + my @list = $pca->results('transformed'); + + #end suppress output from PCA module + select STDOUT; + close(TOOUTPUT); + + my ($xmin,$xmax,$ymin,$ymax); + $xmax = $ymax = -100; + $xmin = $ymin = 100; + + #get min/max values for PC1 + foreach my $v (@{$list[0]}) { + $xmax = &max($xmax,$v); + $xmin = &min($xmin,$v); + } + #get min/max values for PC2 + foreach my $v (@{$list[1]}) { + $ymax = &max($ymax,$v); + $ymin = &min($ymin,$v); + } + + return ([$list[0],$list[1]],sprintf("%d",$variances[0]*100),sprintf("%d",$variances[1]*100),$xmin,$xmax,$ymin,$ymax,$type); +} + +sub createPCAPlot { + my ($data,$var1,$var2,$xmin,$xmax,$ymin,$ymax,$type,$title,$xlab,$ylab,$file) = @_; + + my @linecol = (0, 0, 0, 0.4); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = 5; + my $offset = 20; + my $left = 25; + my $bottom = 15; + my $top = ($type eq 'v' ? 35 : 20); + my $height = 500; + my $space = 10; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+$height+2*$space,$top+$bottom+$offset*2+$height+2*$space); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+$height+2*$space,$top+$bottom+$offset*2+$height+2*$space); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face('sans-serif', 'normal', 'normal'); + + $cr->save; + + #set up work space + my ($dx, $dy); + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, $height+2*$space, $height+2*$space); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #get infos + my $num = scalar(@{$data->[0]})-1; + my $xrange = ($xmax-$xmin); + my $yrange = ($ymax-$ymin); + my $data_info = ($type eq 'v' ? $DATA_VIR : $DATA_MIC); + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->move_to($left+$offset+$space, $top+$offset+$height+2*$space); + $cr->line_to($left+$offset+$space, $top+$offset+$height+2*$space+3); + $cr->move_to($left+$offset+$space+$height, $top+$offset+$height+2*$space); + $cr->line_to($left+$offset+$space+$height, $top+$offset+$height+2*$space+3); + $cr->move_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset+$height+2*$space); + $cr->line_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset+$height+2*$space+3); + $cr->stroke; + #y-axis + $cr->move_to($left+$offset, $top+$offset+$space); + $cr->line_to($left+$offset-3, $top+$offset+$space); + $cr->move_to($left+$offset, $top+$offset+$height+$space); + $cr->line_to($left+$offset-3, $top+$offset+$height+$space); + $cr->move_to($left+$offset, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->line_to($left+$offset-3, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol1); + $cr->move_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset); + $cr->line_to($left+$offset+$space+int(abs($xmin)/$xrange*$height), $top+$offset+$height+2*$space); + $cr->stroke; + $cr->move_to($left+$offset, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->line_to($left+$offset+2*$space+$height, $top+$offset+$space+int(abs($ymax)/$yrange*$height)); + $cr->stroke; + + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + $extents = $cr->text_extents(sprintf("%.2f",$xmin)); + $cr->move_to($left+$offset+$space-$extents->{width}/2-1, $top+$offset+$height+2*$space+$fontheight+2); + $cr->show_text(sprintf("%.2f",$xmin)); + $extents = $cr->text_extents(sprintf("%.2f",$xmax)); + $cr->move_to($left+$offset+$space+$height-$extents->{width}/2-1, $top+$offset+$height+2*$space+$fontheight+2); + $cr->show_text(sprintf("%.2f",$xmax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset+$space+int(abs($xmin)/$xrange*$height)-$extents->{width}/2, $top+$offset+$height+2*$space+$fontheight+2); + $cr->show_text(0); + #y-axis + $extents = $cr->text_extents(sprintf("%.2f",$ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$space+$fontheight/2-2); + $cr->show_text(sprintf("%.2f",$ymax)); + $extents = $cr->text_extents(sprintf("%.2f",$ymin)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$height+$space+$fontheight/2-2); + $cr->show_text(sprintf("%.2f",$ymin)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$space+int(abs($ymax)/$yrange*$height)+$fontheight/2-2); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #add type + $cr->set_source_rgba(0, 0, 0, 0.5); + $extents = $cr->text_extents(uc($type)); + $cr->arc($offset/2+$extents->{width}/2, $offset-5, 10, 0, 2*$PI); + $cr->fill; + $cr->set_source_rgba(1, 1, 1, 1); + $cr->move_to($offset/2-($type eq 'm' ? 1 : 0), $offset); + $cr->show_text(uc($type)); + + #axis labels + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents($xlab.' ('.$var1.'%)'); + $cr->move_to($left+$offset+$height/2-$extents->{width}/2+$space, $top+$offset+$height+$fontheight+15+2*$space); + $cr->show_text($xlab.' ('.$var1.'%)'); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab.' ('.$var2.'%)'); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2)+$space,$offset); + $cr->show_text($ylab.' ('.$var2.'%)'); + + $cr->restore; + + #draw dots + $cr->set_antialias('default'); + $cr->set_font_size (10); + foreach my $i (0..$num) { + $cr->set_source_rgba(@{$data_info->[$i]->[3]}); + $cr->arc(($left+$offset+$space+int(($data->[0]->[$i]+abs($xmin))/$xrange*$height)), ($space+$top+$offset+int(($data->[1]->[$i]+abs($ymin))/$yrange*$height)), $size, 0, 2*$PI); + $cr->fill; + } + $cr->set_source_rgba(0, 0, 0, 1); + foreach my $i (0..$num) { + $extents = $cr->text_extents($data_info->[$i]->[1]); + $cr->move_to(($left+$offset+$space+int(($data->[0]->[$i]+abs($xmin))/$xrange*$height))+$size+1, ($space+$top+$offset+int(($data->[1]->[$i]+abs($ymin))/$yrange*$height))+$size*2); + $cr->show_text($data_info->[$i]->[1]); + } + + #draw legend + my %labels; + foreach my $i (0..$num) { + $labels{$data_info->[$i]->[1]} = $data_info->[$i]->[2]; + } + $cr->set_font_size(10); + $fontheight = $font_extents->{height}; + $cr->set_source_rgba(0, 0, 0, 1); + my $x = $left+$offset+$space; + my $y = int($offset/2); + foreach my $n (sort {$a <=> $b} keys %labels) { + if($x+$cr->text_extents($n.' - '.$labels{$n})->{width}+15 >= $left+$offset+$space+$height) { + $x = $left+$offset+$space; + $y += $fontheight; + } + $cr->move_to($x,$y); + $cr->show_text($n.' - '.$labels{$n}); + $x += $cr->text_extents($n.' - '.$labels{$n})->{width}+15; + + } + + #write image + $cr->show_page; + return $surface; +} + +sub createOddsRatioPlot { + my ($data,$title,$xlab,$ylab,$file) = @_; + + my @yvalues = (0.5,0.78,1.00,1.23,1.5); + + my @linecol = (0, 0, 0, 0.4); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = 40; + my $offset = 20; + my $left = 35; + my $right = 90; + my $bottom = 20; + my $top = 0; + my $height = 100; + my $width = $size*10; + my $space = 20; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+$width+$right,$top+$bottom+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+$width+$right,$top+$bottom+$offset*2+$height); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); + + $cr->save; + + #set up work space + my ($dx, $dy); + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, $width, $height); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #right side marks + $cr->set_source_rgba(255/255, 127/255, 127/255, 0.6); + $cr->rectangle($left+$offset+$width+8, $top+$offset, 3, 0.77/2*$height); + $cr->fill; + $cr->rectangle($left+$offset+$width+8, $top+$offset+$height-0.78/2*$height, 3, 0.78/2*$height); + $cr->fill; + + #get infos + my $num = scalar(keys %$data)-1; + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); + foreach my $i (0..$num) { + $cr->move_to($left+$offset+$size/2+$i*$size, $top+$offset+$height); + $cr->line_to($left+$offset+$size/2+$i*$size, $top+$offset+$height+3); + } + $cr->stroke; + #y-axis + foreach my $i (@yvalues) { + $cr->move_to($left+$offset, $top+$offset+$height-$i/2*$height); + $cr->line_to($left+$offset-3, $top+$offset+$height-$i/2*$height); + } + $cr->stroke; + + #helplines + #x-axis + $cr->set_source_rgba(@helplinecol1); + foreach my $i (0..$num) { + $cr->move_to($left+$offset+$size/2+$i*$size, $top+$offset); + $cr->line_to($left+$offset+$size/2+$i*$size, $top+$offset+$height); + } + $cr->stroke; + #yaxis + foreach my $i (@yvalues) { + $cr->set_source_rgba(0, 0, 0, ($i == 0.5 || $i == 1.00 || $i == 1.50 ? 0.1 : 0.3)); + $cr->move_to($left+$offset, $top+$offset+$height-$i/2*$height); + $cr->line_to($left+$offset+$width, $top+$offset+$height-$i/2*$height); + $cr->stroke; + } + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + my $xcur = 0; + foreach my $dn (map {join("/",(m/../g ))} sort keys %$data) { + $extents = $cr->text_extents($dn); + $cr->move_to($left+$offset+$size/2-$extents->{width}/2-1+$size*$xcur++, $top+$offset+$height+$fontheight+2); + $cr->show_text($dn); + } + #y-axis + foreach my $i (@yvalues) { + $cr->set_source_rgba(0, 0, 0, ($i == 0.5 || $i == 1.00 || $i == 1.50 ? 0.5 : 0.8)); + $extents = $cr->text_extents(sprintf("%.2f",$i)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$height-$i/2*$height+$fontheight/2-2); + $cr->show_text(sprintf("%.2f",$i)); + } + + #label on right side + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents('Over-represented'); + $cr->move_to($left+$offset+$width+15, $top+$offset+$height-1.6/2*$height+$fontheight/2-2); + $cr->show_text('Over-represented'); + $extents = $cr->text_extents('Under-represented'); + $cr->move_to($left+$offset+$width+15, $top+$offset+$height-0.4/2*$height+$fontheight/2-2); + $cr->show_text('Under-represented'); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + #x-axis + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents($xlab); + $cr->move_to($left+$offset+$width/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab); + #y-axis + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset); + $cr->show_text($ylab); + + $cr->restore; + + #draw dots + $cr->set_antialias('default'); + $xcur = 0; + foreach my $dn (sort keys %$data) { + if($data->{$dn} > 1.23 || $data->{$dn} < 0.78) { + $cr->set_source_rgba(255/255, 127/255, 127/255, 1); + } else { + $cr->set_source_rgba(127/255, 127/255, 255/255, 1); + } + $cr->arc($left+$offset+$size/2+$size*$xcur++, $top+$offset+$height-$data->{$dn}/2*$height, 5, 0, 2*$PI); + $cr->fill; + } + + #write image + $cr->show_page; + return $surface; +} + +sub convertToBoxValues { + my ($data,$niceval) = @_; + my ($xmax,$ymax,@matrix); + $xmax = $ymax = 0; + foreach my $i (sort {$a <=> $b} keys %$data) { + $xmax++; + push(@matrix,[$i,$data->{$i}->{min},$data->{$i}->{p25},$data->{$i}->{median},$data->{$i}->{p75},$data->{$i}->{max}]); + $ymax = &max($ymax,$data->{$i}->{max}); + } + + if($niceval) { + $ymax = sprintf("%d",($ymax/$niceval)+1)*$niceval if($ymax % $niceval); + } + + return (\@matrix,$xmax,$ymax); +} + +sub createBoxPlot { + my ($matrix,$xmax,$ymax,$title,$xlab,$ylab,$file,$zero,$add,$bin) = @_; + $bin = ($bin ? $bin : 1); + $zero = 0 unless($zero); + $add = '' unless(defined $add); + if($xmax != 100) { + $xmax = 100; + } + $ymax = 1 unless($ymax); +# die Dumper $matrix; + + + my @col0 = (178/255, 178/255, 255/255); #b2b2ff + my @col1 = (255/255, 178/255, 178/255); #ffb2b2 + my @col3 = (127/255, 127/255, 255/255); #7f7fff + my @col4 = (255/255, 127/255, 127/255); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @linecol0 = (@col3, 1); + my @linecol1 = (@col4, 1); + my @boxcol = (@col3, 1); + my @whiscol = (@col0, 0.9); + my @medcol = (0,0,0, 0.5); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = 6; + my $offset = 20; + my $left = 25; + my $bottom = 25; + my $top = 5; + my $height = 300; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); +# $cr->set_font_size (30); + + $cr->save; + + #set up work space + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #draw legend + $cr->set_font_size(10); +# $font_extents = $cr->font_extents; + my $x = $left+$offset+$size*50; + foreach my $v ([\@whiscol,'Min/Max value'],[\@boxcol,'25th to 75th percentile'],[\@medcol,'Median']) { + $cr->set_antialias('none'); + $cr->set_source_rgba(@{$v->[0]}); + $cr->rectangle($x, $top+5, 10, 10); + $cr->fill; + $x += 15; + $cr->set_antialias('default'); + $cr->move_to($x,$top+5+9); + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->show_text($v->[1]); + $x += $cr->text_extents($v->[1])->{width}+15; + } + + $cr->set_antialias('none'); + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); +# $cr->move_to($left+$offset+int($size/2+1), $top+$offset+$height); +# $cr->line_to($left+$offset+int($size/2+1), $top+$offset+$height+3); +# $cr->move_to($left+$offset+int($size/2+1), $top+$offset+$height+$space); +# $cr->line_to($left+$offset+int($size/2+1), $top+$offset+$height+$space-3); + foreach my $i (1..9) { + $cr->move_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); +# $cr->move_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset); +# $cr->line_to($left+$offset+int($size/2)+$size*10*$i-($zero ? 0 : $size)-1, $top+$offset-3); + } + $cr->stroke; + #y-axis + foreach my $j (0..4) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset-3, $top+$offset+$height*$j/4-($j ? 1 : 0)); +# $cr->move_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); +# $cr->line_to($left+$offset+($xmax+$zero)*$size+3, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol1); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis +# $extents = $cr->text_extents(1); +# $cr->move_to($left+$offset+int($size/2+1)-$extents->{width}, $top+$offset+$height+$fontheight+2); +# $cr->show_text(1); + foreach my $i (1..9) { + $extents = $cr->text_extents($i*10*$bin); + $cr->move_to($left+$offset+int($size/2+1)+$size*10*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i*10*$bin); + } + #y-axis + foreach my $j (0..4) { + $extents = $cr->text_extents(&addCommas($ymax*$j/4)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height*(4-$j)/4); + $cr->show_text(&addCommas($ymax*$j/4)); + } + + $cr->save; + + #axis labels + $cr->set_source_rgba(0, 0, 0, 1); + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + $extents = $cr->text_extents($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset); + $cr->show_text($ylab); + + $cr->restore; + + #draw boxes + my $factor = $height/$ymax; + $cr->set_antialias('none'); + foreach my $v (@$matrix) { + #wiskers + $cr->set_source_rgba(@whiscol); + if($v->[1] != $v->[2]) { + $cr->move_to($left+$offset+$size*$v->[0]+1, $top+$offset+$height-$v->[1]*$factor-1); + $cr->line_to($left+$offset+$size*$v->[0]+$size-2, $top+$offset+$height-$v->[1]*$factor-1); + $cr->stroke; + } + if($v->[4] != $v->[5]) { + $cr->move_to($left+$offset+$size*$v->[0]+1, $top+$offset+$height-$v->[5]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+$size-2, $top+$offset+$height-$v->[5]*$factor); + $cr->stroke; + } + $cr->save; + $cr->set_dash(1,4,3); + if($v->[1] != $v->[2]) { + $cr->move_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[2]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[1]*$factor); + $cr->stroke; + } + if($v->[4] != $v->[5]) { + $cr->move_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[5]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+int($size/2)-1, $top+$offset+$height-$v->[4]*$factor-1); + $cr->stroke; + } + $cr->restore; + #box + if(($v->[2] != $v->[3]) || ($v->[4] != $v->[3])) { + $cr->set_source_rgba(@whiscol); + $cr->rectangle($left+$offset+$size*$v->[0], $top+$offset+$height-$v->[2]*$factor, $size-1, -($v->[4]-$v->[2])*$factor); + $cr->fill; + $cr->stroke; + $cr->set_source_rgba(@boxcol); + $cr->rectangle($left+$offset+$size*$v->[0], $top+$offset+$height-$v->[2]*$factor, $size-2, -($v->[4]-$v->[2])*$factor); + $cr->stroke; + } else { + $cr->set_source_rgba(@boxcol); + $cr->move_to($left+$offset+$size*$v->[0], $top+$offset+$height-$v->[3]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+$size-1, $top+$offset+$height-$v->[3]*$factor); + $cr->stroke; + } + #median + $cr->set_source_rgba(@medcol); + $cr->move_to($left+$offset+$size*$v->[0]+1, $top+$offset+$height-$v->[3]*$factor); + $cr->line_to($left+$offset+$size*$v->[0]+$size-2, $top+$offset+$height-$v->[3]*$factor); + $cr->stroke; + } + + #write image + $cr->show_page; + return $surface; +} + +sub convertToBarValues { + my ($data,$niceval,$start,$max) = @_; + my ($xmax,$ymax,@matrix,$tmp); + $xmax = $ymax = 0; + + #get xmax value + if($max) { + $xmax = $max; + } else { + foreach my $q (keys %$data) { + $xmax = &max($xmax,$q); + } + } + if($niceval) { + $xmax = sprintf("%d",($xmax/$niceval)+1)*$niceval if($xmax % $niceval); + } + + #get matrix values + foreach my $q ($start..$xmax) { + $tmp = (exists $data->{$q} ? $data->{$q} : 0); + $ymax = &max($ymax,$tmp); + push(@matrix,$tmp); + } + + $ymax = sprintf("%d",($ymax/4)+1)*4 if($ymax % 4); + + return (\@matrix,$xmax,$ymax); +} + +sub createBarPlot { + my ($matrix,$xmax,$ymax,$title,$xlab,$ylab,$file,$zero) = @_; + + my @col0 = (178/255, 178/255, 255/255); #b2b2ff + my @col1 = (255/255, 178/255, 178/255); #ffb2b2 + my @col3 = (127/255, 127/255, 255/255); #7f7fff + my @col4 = (255/255, 127/255, 127/255); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @linecol0 = (@col3, 1); + my @linecol1 = (@col4, 1); + my @barcol0 = (@col3, 1); + my @barcol1 = (@col4, 1); + my @helplinecol1 = (1,1,1, 0.9); + my @helplinecol2 = (1,1,1, 0.5); + + #create new image + my $size = ($xmax <= 50 ? 10 : ($xmax <= 100 ? 6 : 3)); + my $offset = 20; + my $left = 25; + my $bottom = 15; + my $top = 0; + my $height = 200; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); +# $cr->set_font_size (30); + + $cr->save; + + #set up work space + my ($dx, $dy); + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(0.95, 0.95, 0.95, 1); + $cr->fill; + + #draw ticks + #x-axis + $cr->set_source_rgba(0, 0, 0, 0.8); + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); + } else { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+1); + } + } + $cr->stroke; + + #y-axis + $cr->move_to($left+$offset, $top+$offset); + $cr->line_to($left+$offset-3, $top+$offset); + $cr->move_to($left+$offset, $top+$offset+$height-1); + $cr->line_to($left+$offset-3, $top+$offset+$height-1); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol1); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(0, 0, 0, 0.8); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $extents = $cr->text_extents($i); + $cr->move_to($left+$offset+int($size/2+1)+$size*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i); + } + } + #y-axis + $extents = $cr->text_extents(&addCommas($ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2); + $cr->show_text(&addCommas($ymax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size (14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + $cr->set_source_rgba(0, 0, 0, 1); + $extents = $cr->text_extents($xlab); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2),$offset); + $cr->show_text($ylab); + + $cr->restore; + + #draw boxes + $cr->set_antialias('none'); + foreach my $pos (0..$xmax-($zero ? 0 : 1)) { + next unless($matrix->[$pos+($zero ? 0 : 1)]); + my $tmp = $matrix->[$pos+($zero ? 0 : 1)] / $ymax; + #unique + if($tmp) { + $cr->set_source_rgba(@barcol0); + $cr->rectangle($left+$offset+($pos+($zero ? 0 : 1))*$size, $top+$offset+$height, $size-1, -$tmp*$height); + $cr->fill; + } + } + + #write image + $cr->show_page; + return $surface; +} + +sub convertOdToStackBinMatrix { + my ($data,$stacks,$min,$max,$nonice) = @_; + + my ($num,$ymax,$xmax,$xmin,$step,%vals,%sums,$sum,@matrix,$bin,$tmpbin); + + #make nice xmax value + if(defined $max) { + $xmax = $max; + } else { + $xmax = (sort {$b <=> $a} keys %$data)[0]; + } + $bin = &getBinVal($xmax); + $xmax = $bin*100; + $xmin = (defined $min ? $min : 0); + + #get data to bin and find y axis max value + $ymax = 0; + foreach my $s (0..$stacks-1) { + $sums{$s} = 0; + } + $sum = 0; + $tmpbin = $bin; + foreach my $i ($xmin..$xmax) { + foreach my $s (0..$stacks-1) { + next unless(exists $data->{$i}->{$s}); + $sums{$s} += $data->{$i}->{$s}; + $sum += $data->{$i}->{$s}; + } + if(--$tmpbin <= 0) { + $tmpbin = $bin; + $ymax = &max($ymax,$sum); + $sum = 0; + foreach my $s (0..$stacks-1) { + push(@{$matrix[$s]},$sums{$s}); + $sums{$s} = 0; + } + } + } + + #make nice ymax value + unless($nonice) { + $ymax = sprintf("%d",($ymax/4)+1)*4 if($ymax % 4); +# $step = ($ymax <= 10 ? 10 : ($ymax < 40 ? 40 : ($ymax < 100 ? 100 : ($ymax < 1000 ? 100 : 100)))); +# $ymax = sprintf("%d",($ymax/$step)+1)*$step if($ymax % $step); + } + + return (\@matrix,$xmax,$ymax,$stacks); +} + +sub createStackBarPlot { + my ($matrix,$xmax,$ymax,$stacks,$title,$xlab,$ylab,$file,$zero,$add) = @_; + + my $bin = 1; + if($xmax > 100) { + $bin = $xmax / 100; + $xmax = 100; + } + + my @legend = ('Exact dupl.','5\' dupl.','3\' dupl.','Rev. compl. exact dupl.','Rev. compl. 5\'/3\' dupl.'); + my @cols = ([69/255, 114/255, 167/255, 1], + [137/255, 1165/255, 78/255, 1], + [170/255, 70/255, 67/255, 1], + [147/255, 169/255, 207/255, 1], + [51/255, 102/255, 102/255, 1]); + my @barcol = (127/255, 127/255, 255/255, 1); #b2b2ff + my @meancol = (255/255, 127/255, 127/255, 1); #ffb2b2 + my @stdcol = (178/255, 178/255, 255/255, 0.8); #7f7fff + my @std1col = (0, 0, 0, 0.02); #ff7f7f + my @std2col = (0, 0, 0, 0.02); #ff7f7f + my @linecol = (0, 0, 0, 0.4); + my @helplinecol = (1, 1, 1, 0.9); + my @background = (0.95, 0.95, 0.95, 1); + my @tickcol = (0, 0, 0, 0.8); + my @labelcol = (0, 0, 0, 1); + + #create new image + my $size = 6; + my $offset = 20; + my $left = 40; + my $bottom = 15; + my $top = 20; + my $height = 200; + my $surface = Cairo::ImageSurface->create('argb32', $left+$offset*2+($xmax+$zero)*$size,$bottom+$top+$offset*2+$height); #format, width, height + my $cr = Cairo::Context->create($surface); + + my ($font_extents,$extents,$fontheight,$fontdescent); + + #background + $cr->rectangle(0, 0, $left+$offset*2+($xmax+$zero)*$size,$bottom+$offset*2+2*200+20); + $cr->set_source_rgba(1, 1, 1, 1); + $cr->fill; + + #fonts + $cr->select_font_face ('sans', 'normal', 'normal'); + + $cr->save; + + #set up work space + $cr->set_antialias('none'); + $cr->set_line_width(1); + + #background for plot + $cr->rectangle($left+$offset, $top+$offset, ($xmax+$zero)*$size-1, $height); + $cr->set_source_rgba(@background); + $cr->fill; + + #draw legend + $cr->set_font_size(10); +# $font_extents = $cr->font_extents; + my $x = $left+$offset+$size*100-5; + foreach my $i (reverse (0..scalar(@legend)-1)) { + $cr->set_antialias('default'); + $x -= $cr->text_extents($legend[$i])->{width}; + $cr->move_to($x,$top+5+9); + $cr->set_source_rgba(@tickcol); + $cr->show_text($legend[$i]); + $x -= 15; + $cr->set_antialias('none'); + $cr->set_source_rgba(@{$cols[$i]}); + $cr->rectangle($x, $top+5, 10, 10); + $cr->fill; + $x -= 15; + } + + #draw ticks + $cr->set_antialias('none'); + #x-axis + $cr->set_source_rgba(@tickcol); + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%5) == 0 && $i > 1 && $i < $xmax) { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+3); + } else { + $cr->move_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height); + $cr->line_to($left+$offset+int($size/2)+$size*$i-($zero ? 0 : $size)-1, $top+$offset+$height+1); + } + } + $cr->stroke; + + #y-axis + $cr->move_to($left+$offset, $top+$offset); + $cr->line_to($left+$offset-3, $top+$offset); + $cr->move_to($left+$offset, $top+$offset+$height-1); + $cr->line_to($left+$offset-3, $top+$offset+$height-1); + $cr->stroke; + + #helplines + $cr->set_source_rgba(@helplinecol); + foreach my $j (1..3) { + $cr->move_to($left+$offset, $top+$offset+$height*$j/4-($j ? 1 : 0)); + $cr->line_to($left+$offset+($xmax+$zero)*$size, $top+$offset+$height*$j/4-($j ? 1 : 0)); + } + $cr->stroke; + + $cr->set_antialias('default'); + + #tick labels + $cr->set_source_rgba(@tickcol); + $cr->set_font_size(10); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + #x-axis + foreach my $i (($zero ? 0 : 1)..$xmax) { + if(($i%10) == 0 && $i > 1 && $i < $xmax) { + $extents = $cr->text_extents($i*$bin); + $cr->move_to($left+$offset+int($size/2+1)+$size*$i-($zero ? 0 : $size)-$extents->{width}/2-1-($i == 1 ? 1 : 0), $top+$offset+$height+$fontheight+2); + $cr->show_text($i*$bin); + } + } + #y-axis + $extents = $cr->text_extents(&addCommas($ymax)); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2); + $cr->show_text(&addCommas($ymax)); + $extents = $cr->text_extents(0); + $cr->move_to($left+$offset-5-$extents->{width}, $top+$offset+$fontheight/2-2+$height); + $cr->show_text(0); + + $cr->save; + + #labels + $cr->set_font_size(14); + $font_extents = $cr->font_extents; + $fontheight = $font_extents->{height}; + + #axis labels + $cr->set_source_rgba(@labelcol); + $extents = $cr->text_extents($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->move_to($left+$offset+($xmax+$zero)*$size/2-$extents->{width}/2, $top+$offset+$height+$fontheight+15); + $cr->show_text($xlab.($bin>1 ? ' (Bin size: '.$bin.($add ? $add.')' : '') : '')); + $cr->rotate($PI * 3 / 2); + $extents = $cr->text_extents($ylab.($bin>1 ? ' (per bin)' : '')); + $cr->move_to(-($top+$offset+$height/2+$extents->{width}/2+($bin>1 ? 12 : 0)),$offset+10); + $cr->show_text($ylab.($bin>1 ? ' (per bin)' : '')); + + $cr->restore; + + #draw boxes + $cr->set_antialias('none'); + foreach my $pos (0..$xmax-($zero ? 0 : 1)) { + my $tmp = 0; + foreach my $s (0..$stacks-1) { + next unless($matrix->[$s]->[$pos]); + my $cur = $matrix->[$s]->[$pos] / $ymax; + $cr->set_source_rgba(@{$cols[$s]}); + if($cur) { + $cr->rectangle($left+$offset+$pos*$size, $top+$offset+$height-$tmp*$height, $size-1, -$cur*$height); + $cr->fill; + } + $tmp += $cur; + } + } + + #write image + $cr->show_page; + return $surface; +} + +sub header { + return ' + + + +PRINSEQ-'.$WHAT.' Report + + + +
'; +} + +sub footer { + return '
'; +} + +sub generateHtml { + my ($in,$out) = @_; + my ($file,$data,$surface,$html,$png); + $data = &readGdFile($in); + my $time = sprintf("%02d/%02d/%04d %02d:%02d:%02d",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); + + $html .= &header(); + $html .= '

PRINSEQ-'.$WHAT.' v'.$VERSION.' HTML Report   

[Generated: '.$time.']

'; + $html .= '
'; + + #input info + if(exists $data->{numseqs}) { + $html .= '
Input Information
'; + $html .= '
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + my $singletons1 = ($data->{numseqs}||0)-($data->{pairs}||0); + my $singletons2 = ($data->{numseqs2}||0)-($data->{pairs}||0); + $html .= ''; + } else { + $html .= ''; + } + $html .= '
Input file(s):'.($data->{filename1} ? &convertIntToString($data->{filename1}) : '-').($data->{filename2} ? ' and '.&convertIntToString($data->{filename2}) : '').'
Input format(s):'.($data->{format1} ? uc($data->{format1}) : '-').($data->{format2} ? ' and '.uc($data->{format2}) : '').'
# Sequences (file 1):'.&addCommas($data->{numseqs}||'-').'
Total bases (file 1):'.&addCommas($data->{numbases}||'-').'
# Sequences (file 2):'.&addCommas($data->{numseqs2}||'-').'
Total bases (file 2):'.&addCommas($data->{numbases2}||'-').'
# Pairs:'.&addCommas($data->{pairs}||'-').($data->{pairs} ? '  ('.sprintf("%.2f",(100*(2*$data->{pairs})/(($data->{numseqs}||0)+($data->{numseqs2}||0)))).'% of sequences)' : '').'
# Singletons (file 1):'.&addCommas($singletons1).($singletons1 ? '  ('.sprintf("%.2f",(100*$singletons1/$data->{numseqs})).'%)' : '').'
# Singletons (file 2):'.&addCommas($singletons2).($singletons2 ? '  ('.sprintf("%.2f",(100*$singletons2/$data->{numseqs2})).'%)' : '').'
# Sequences:'.&addCommas($data->{numseqs}||'-').'
Total bases:'.&addCommas($data->{numbases}||'-').'

'; + } + + #length plot + if(exists $data->{counts}->{length} && keys %{$data->{counts}->{length}}) { + $html .= '
Length Distribution
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= '
File 1
Mean sequence length: '.(exists $data->{stats}->{length}->{mean} ? sprintf("%.2f",$data->{stats}->{length}->{mean}) : '-').' ± '.(exists $data->{stats}->{length}->{std} ? sprintf("%.2f",$data->{stats}->{length}->{std}) : '-').' bp
Minimum length: '.(exists $data->{stats}->{length}->{min} ? &addCommas($data->{stats}->{length}->{min}) : '-').' bp
Maximum length:'.(exists $data->{stats}->{length}->{max} ? &addCommas($data->{stats}->{length}->{max}) : '-').' bp
Length range:'.(exists $data->{stats}->{length}->{range} ? &addCommas($data->{stats}->{length}->{range}) : '-').' bp
Mode length: '.(exists $data->{stats}->{length}->{mode} ? &addCommas($data->{stats}->{length}->{mode}) : '-').' bp with '.(exists $data->{stats}->{length}->{modeval} ? &addCommas($data->{stats}->{length}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{length},1),$data->{stats}->{length},'Length Distribution','Read Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + $html .= '

File 2
Mean sequence length: '.(exists $data->{stats2}->{length}->{mean} ? sprintf("%.2f",$data->{stats2}->{length}->{mean}) : '-').' ± '.(exists $data->{stats2}->{length}->{std} ? sprintf("%.2f",$data->{stats2}->{length}->{std}) : '-').' bp
Minimum length: '.(exists $data->{stats2}->{length}->{min} ? &addCommas($data->{stats2}->{length}->{min}) : '-').' bp
Maximum length:'.(exists $data->{stats2}->{length}->{max} ? &addCommas($data->{stats2}->{length}->{max}) : '-').' bp
Length range:'.(exists $data->{stats2}->{length}->{range} ? &addCommas($data->{stats2}->{length}->{range}) : '-').' bp
Mode length: '.(exists $data->{stats2}->{length}->{mode} ? &addCommas($data->{stats2}->{length}->{mode}) : '-').' bp with '.(exists $data->{stats2}->{length}->{modeval} ? &addCommas($data->{stats2}->{length}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{length},1),$data->{stats2}->{length},'Length Distribution','Read Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } else { + $html .= '
Mean sequence length: '.(exists $data->{stats}->{length}->{mean} ? sprintf("%.2f",$data->{stats}->{length}->{mean}) : '-').' ± '.(exists $data->{stats}->{length}->{std} ? sprintf("%.2f",$data->{stats}->{length}->{std}) : '-').' bp
Minimum length: '.(exists $data->{stats}->{length}->{min} ? &addCommas($data->{stats}->{length}->{min}) : '-').' bp
Maximum length:'.(exists $data->{stats}->{length}->{max} ? &addCommas($data->{stats}->{length}->{max}) : '-').' bp
Length range:'.(exists $data->{stats}->{length}->{range} ? &addCommas($data->{stats}->{length}->{range}) : '-').' bp
Mode length: '.(exists $data->{stats}->{length}->{mode} ? &addCommas($data->{stats}->{length}->{mode}) : '-').' bp with '.(exists $data->{stats}->{length}->{modeval} ? &addCommas($data->{stats}->{length}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{length},1),$data->{stats}->{length},'Length Distribution','Read Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } + $html .= '
'; + } + + #GC content + if(exists $data->{counts}->{gc} && keys %{$data->{counts}->{gc}}) { + $html .= '
GC Content Distribution
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= '
File 1
Mean GC content: '.(exists $data->{stats}->{gc}->{mean} ? sprintf("%.2f",$data->{stats}->{gc}->{mean}) : '-').' ± '.(exists $data->{stats}->{gc}->{std} ? sprintf("%.2f",$data->{stats}->{gc}->{std}) : '-').' %
Minimum GC content: '.(exists $data->{stats}->{gc}->{min} ? $data->{stats}->{gc}->{min} : '-').' %
Maximum GC content: '.(exists $data->{stats}->{gc}->{max} ? $data->{stats}->{gc}->{max} : '-').' %
GC content range: '.(exists $data->{stats}->{gc}->{range} ? $data->{stats}->{gc}->{range} : '-').' %
Mode GC content: '.(exists $data->{stats}->{gc}->{mode} ? $data->{stats}->{gc}->{mode} : '-').' % with '.(exists $data->{stats}->{gc}->{modeval} ? &addCommas($data->{stats}->{gc}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{gc},0),$data->{stats}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + $html .= '

File 2
Mean GC content: '.(exists $data->{stats2}->{gc}->{mean} ? sprintf("%.2f",$data->{stats2}->{gc}->{mean}) : '-').' ± '.(exists $data->{stats2}->{gc}->{std} ? sprintf("%.2f",$data->{stats2}->{gc}->{std}) : '-').' %
Minimum GC content: '.(exists $data->{stats2}->{gc}->{min} ? $data->{stats2}->{gc}->{min} : '-').' %
Maximum GC content: '.(exists $data->{stats2}->{gc}->{max} ? $data->{stats2}->{gc}->{max} : '-').' %
GC content range: '.(exists $data->{stats2}->{gc}->{range} ? $data->{stats2}->{gc}->{range} : '-').' %
Mode GC content: '.(exists $data->{stats2}->{gc}->{mode} ? $data->{stats2}->{gc}->{mode} : '-').' % with '.(exists $data->{stats2}->{gc}->{modeval} ? &addCommas($data->{stats2}->{gc}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{gc},0),$data->{stats2}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } else { + $html .= '
Mean GC content: '.(exists $data->{stats}->{gc}->{mean} ? sprintf("%.2f",$data->{stats}->{gc}->{mean}) : '-').' ± '.(exists $data->{stats}->{gc}->{std} ? sprintf("%.2f",$data->{stats}->{gc}->{std}) : '-').' %
Minimum GC content: '.(exists $data->{stats}->{gc}->{min} ? $data->{stats}->{gc}->{min} : '-').' %
Maximum GC content: '.(exists $data->{stats}->{gc}->{max} ? $data->{stats}->{gc}->{max} : '-').' %
GC content range: '.(exists $data->{stats}->{gc}->{range} ? $data->{stats}->{gc}->{range} : '-').' %
Mode GC content: '.(exists $data->{stats}->{gc}->{mode} ? $data->{stats}->{gc}->{mode} : '-').' % with '.(exists $data->{stats}->{gc}->{modeval} ? &addCommas($data->{stats}->{gc}->{modeval}) : '-').' sequences

'; + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{gc},0),$data->{stats}->{gc},'GC Content Distribution','GC Content (0-100%)','Number of Sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $html .= '
'; + } + $html .= '
'; + } + + #Base quality + if(exists $data->{quals} || exists $data->{qualsmean} || exists $data->{qualsbin}) { + $html .= '
Base Quality Distribution
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + } + if(exists $data->{quals} && keys %{$data->{quals}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{quals},4),'Base Quality Distribution','Read position in %','Quality score',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + } + if(exists $data->{qualsbin} && keys %{$data->{qualsbin}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin},4),'Base Quality Distribution','Read position in bp','Quality score','',0,'bp',$data->{binval}); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{quals}); + $html .= &insert_image($png); + } + if(exists $data->{qualsmean} && keys %{$data->{qualsmean}}) { + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{qualsbin}); + $html .= &insert_image($png); + } + if(exists $data->{pairedend} && $data->{pairedend}) { + if(exists $data->{quals} || exists $data->{qualsmean} || exists $data->{qualsbin}) { + $html .= '


File 2
'; + } + if(exists $data->{quals2} && keys %{$data->{quals2}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{quals2},4),'Base Quality Distribution','Read position in %','Quality score',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + } + if(exists $data->{qualsbin2} && keys %{$data->{qualsbin2}}) { + $surface = &createBoxPlot(&convertToBoxValues($data->{qualsbin2},4),'Base Quality Distribution','Read position in bp','Quality score','',0,'bp',$data->{binval}); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{quals2}); + $html .= &insert_image($png); + } + if(exists $data->{qualsmean2} && keys %{$data->{qualsmean2}}) { + $surface = &createBarPlot(&convertToBarValues($data->{qualsmean2},5,1),'Sequence Quality Distribution','Mean of quality scores per sequence','Number of sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{qualsbin2}); + $html .= &insert_image($png); + } + } + if(exists $data->{quals} || exists $data->{qualsmean} || exists $data->{qualsbin}) { + $html .= '

'; + } + + #Ns + if((exists $data->{counts}->{ns} && keys %{$data->{counts}->{ns}}) || (exists $data->{counts2} && exists $data->{counts2}->{ns} && keys %{$data->{counts2}->{ns}})) { + $html .= '
Occurence of N
'; + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + } + if(exists $data->{counts}->{ns} && keys %{$data->{counts}->{ns}}) { + my $nscount = 0; + foreach my $n (values %{$data->{counts}->{ns}}) { + $nscount += $n; + } + $html .= '
Sequences with N: '.($nscount ? &addCommas($nscount).'  ('.sprintf("%.2f",100/$data->{numseqs}*$nscount).' %)' : 0).'
Max percentage of Ns per sequence: '.(exists $data->{stats}->{ns}->{max} ? $data->{stats}->{ns}->{max} : 0).' %
'; + if($nscount) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if(exists $data->{pairedend} && $data->{pairedend} && exists $data->{counts2}->{ns} && keys %{$data->{counts2}->{ns}}) { + $html .= '


File 2
'; + my $nscount = 0; + foreach my $n (values %{$data->{counts2}->{ns}}) { + $nscount += $n; + } + $html .= '
Sequences with N: '.($nscount ? &addCommas($nscount).'  ('.sprintf("%.2f",100/$data->{numseqs2}*$nscount).' %)' : 0).'
Max percentage of Ns per sequence: '.(exists $data->{stats2}->{ns}->{max} ? $data->{stats2}->{ns}->{max} : 0).' %
'; + if($nscount) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{ns},1),undef,'Percentage of N\'s (> 0%)','Percentage of N\'s per Read (1-100%)','# Sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if((exists $data->{counts}->{ns} && keys %{$data->{counts}->{ns}}) || (exists $data->{counts2} && exists $data->{counts2}->{ns} && keys %{$data->{counts2}->{ns}})) { + $html .= '

'; + } + + #tails + if(exists $data->{tail} || exists $data->{tail2}) { + $html .= '
Poly-A/T Tails
'; + } + if(exists $data->{tail}) { + my $tail5count = 0; + foreach my $n (values %{$data->{counts}->{tail5}}) { + $tail5count += $n; + } + my $tail3count = 0; + foreach my $n (values %{$data->{counts}->{tail3}}) { + $tail3count += $n; + } + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + $html .= '
5\'-end 3\'-end
Sequences with tail:'.($tail5count ? &addCommas($tail5count).'  ('.sprintf("%.2f",100/$data->{numseqs}*$tail5count).' %)' : 0).' '.($tail3count ? &addCommas($tail3count).'  ('.sprintf("%.2f",100/$data->{numseqs}*$tail3count).' %)' : 0).'
Maximum tail length: '.(exists $data->{stats}->{tail5}->{max} ? $data->{stats}->{tail5}->{max} : 0).' '.(exists $data->{stats}->{tail3}->{max} ? $data->{stats}->{tail3}->{max} : 0).'
'; + if($tail5count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + if($tail3count) { + $html .= '
'; + } + } + if($tail3count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if(exists $data->{pairedend} && $data->{pairedend} && exists $data->{tail2}) { + my $tail5count = 0; + foreach my $n (values %{$data->{counts2}->{tail5}}) { + $tail5count += $n; + } + my $tail3count = 0; + foreach my $n (values %{$data->{counts2}->{tail3}}) { + $tail3count += $n; + } + $html .= '


File 2
'; + $html .= '
5\'-end 3\'-end
Sequences with tail:'.($tail5count ? &addCommas($tail5count).'  ('.sprintf("%.2f",100/$data->{numseqs2}*$tail5count).' %)' : 0).' '.($tail3count ? &addCommas($tail3count).'  ('.sprintf("%.2f",100/$data->{numseqs2}*$tail3count).' %)' : 0).'
Maximum tail length: '.(exists $data->{stats2}->{tail5}->{max} ? $data->{stats2}->{tail5}->{max} : 0).' '.(exists $data->{stats2}->{tail3}->{max} ? $data->{stats2}->{tail3}->{max} : 0).'
'; + if($tail5count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail5},1),undef,'Poly-A/T Tail Distribution (> 4bp)','5\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + if($tail3count) { + $html .= '
'; + } + } + if($tail3count) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{counts2}->{tail3},1),undef,'Poly-A/T Tail Distribution (> 4bp)','3\' Tail Length in bp','# Sequences','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + } + if(exists $data->{tail} || exists $data->{tail2}) { + $html .= '

'; + } + + + #tag sequence check + if(exists $data->{freqs} || exists $data->{freqs2}) { + $html .= '
Tag Sequence Check
'; + } + if(exists $data->{freqs}) { + my $tagmidseq; + if(exists $data->{tagmidseq}) { + $tagmidseq = $data->{tagmidseq}; + $tagmidseq =~ s/\,/\
/g; + } + if(exists $data->{pairedend} && $data->{pairedend}) { + $html .= 'File 1
'; + } + $html .= '
5\'-end3\'-end
Probability of tag sequence:'.(exists $data->{tagprob}->{5} ? $data->{tagprob}->{5}.' %' : '-').''.(exists $data->{tagprob}->{3} ? $data->{tagprob}->{3}.' %' : '-').'
GSMIDs or RLMIDs:'.(exists $data->{tagmidnum} ? ($data->{tagmidnum} == 0 ? 'none' : ($tagmidseq ? $tagmidseq : $data->{tagmidnum})) : '-').' 

'; + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs}->{5}}) { + $html .= ''; + } + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs}->{3}}) { + $html .= ''; + } + $html .= ''; + $html .= ''; + foreach my $num (1,0,0,0,5,0,0,0,0,10,0,0,0,0,15,0,0,0,0,20,0,20,0,0,0,0,15,0,0,0,0,10,0,0,0,0,5,0,0,0,1) { + $html .= ''; + } + $html .= ''; + $html .= '
'.&insert_image($FREQCHART_L,undef,undef,1).''; + foreach my $base (qw(A C G T N)) { + if($data->{freqs}->{5}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs}->{5}->{$pos}->{$base},14,1).'
'; + #''.$base.'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 ... '; + foreach my $base (qw(A C G T N)) { + if($data->{freqs}->{3}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs}->{3}->{$pos}->{$base},14,1).'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 '.($num ? $num : '').' 
 Position from Sequence Ends
'; + } + if(exists $data->{pairedend} && $data->{pairedend} && exists $data->{freqs2}) { + $html .= '


File 2
'; + $html .= '
5\'-end3\'-end
Probability of tag sequence:'.(exists $data->{tagprob2}->{5} ? $data->{tagprob2}->{5}.' %' : '-').''.(exists $data->{tagprob2}->{3} ? $data->{tagprob2}->{3}.' %' : '-').'

'; + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs2}->{5}}) { + $html .= ''; + } + $html .= ''; + foreach my $pos (sort {$a <=> $b} keys %{$data->{freqs2}->{3}}) { + $html .= ''; + } + $html .= ''; + $html .= ''; + foreach my $num (1,0,0,0,5,0,0,0,0,10,0,0,0,0,15,0,0,0,0,20,0,20,0,0,0,0,15,0,0,0,0,10,0,0,0,0,5,0,0,0,1) { + $html .= ''; + } + $html .= ''; + $html .= '
'.&insert_image($FREQCHART_L,undef,undef,1).''; + foreach my $base (qw(A C G T N)) { + if($data->{freqs2}->{5}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs2}->{5}->{$pos}->{$base},14,1).'
'; + #''.$base.'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 ... '; + foreach my $base (qw(A C G T N)) { + if($data->{freqs2}->{3}->{$pos}->{$base}) { + $html .= &insert_image($BASE64_BASES->{$base},$data->{freqs2}->{3}->{$pos}->{$base},14,1).'
'; + } + } + $html .= &insert_image($MMCHART_B2,6,16,1).'
 '.($num ? $num : '').' 
 Position from Sequence Ends
'; + } + if(exists $data->{freqs} || exists $data->{freqs2}) { + $html .= '

'; + } + + #Sequence duplicates + if(exists $data->{dubslength} || exists $data->{dubscounts}) { + $html .= '
Sequence Duplication
'; + } + my %dubs; + if(exists $data->{dubscounts} && keys %{$data->{dubscounts}}) { + my $exactonly = $data->{exactonly}||0; + foreach my $n (keys %{$data->{dubscounts}}) { + foreach my $s (keys %{$data->{dubscounts}->{$n}}) { + $dubs{$s}->{count} += $data->{dubscounts}->{$n}->{$s} * $n; + $dubs{$s}->{max} = $n unless(exists $dubs{$s}->{max} && $dubs{$s}->{max} > $n); + $dubs{all} += $data->{dubscounts}->{$n}->{$s} * $n; + } + } + $html .= '
'; + unless($exactonly) { + $html .= ''; + } + $html .= '
# Sequences Max duplicates
Exact duplicates:'.(exists $dubs{0}->{count} ? &addCommas($dubs{0}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{0}->{count}).' %)' : 0).''.($dubs{0}->{max}||0).'
Exact duplicates with reverse complements:'.(exists $dubs{3}->{count} ? &addCommas($dubs{3}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{3}->{count}).' %)' : 0).' '.($dubs{3}->{max}||0).'
5\' duplicates'.(exists $dubs{1}->{count} ? &addCommas($dubs{1}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{1}->{count}).' %)' : 0).' '.($dubs{1}->{max}||0).'
3\' duplicates'.(exists $dubs{2}->{count} ? &addCommas($dubs{2}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{2}->{count}).' %)' : 0).' '.($dubs{2}->{max}||0).'
5\'/3\' duplicates with reverse complements'.(exists $dubs{4}->{count} ? &addCommas($dubs{4}->{count}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{4}->{count}).' %)' : 0).' '.($dubs{4}->{max}||0).'
Total:'.(exists $dubs{all} ? &addCommas($dubs{all}).'  ('.sprintf("%.2f",100/$data->{numseqs}*$dubs{all}).' %)' : 0).'-
'; + } + if(exists $dubs{all} && $dubs{all}) { + if(exists $data->{dubslength} && keys %{$data->{dubslength}}) { + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubslength},5,1),'Sequence duplication level','Read Length in bp','Number of duplicates','',0,' bp'); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '
'.&insert_image($png); + } + if(exists $data->{dubscounts} && keys %{$data->{dubscounts}}) { + $surface = &createStackBarPlot(&convertOdToStackBinMatrix($data->{dubscounts},5,1,100),'Sequence duplication level','Number of duplicates','Number of sequences','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{dubslength}); + $html .= &insert_image($png); + my %dubsmax; + my $count = 1; + foreach my $n (sort {$b <=> $a} keys %{$data->{dubscounts}}) { + foreach my $s (keys %{$data->{dubscounts}->{$n}}) { + foreach my $i (1..$data->{dubscounts}->{$n}->{$s}) { + $dubsmax{$count++}->{$s} = $n; + last unless($count <= 100); + } + last unless($count <= 100); + } + last unless($count <= 100); + } + $surface = &createStackBarPlot(&convertOdToStackBinMatrix(\%dubsmax,5,1,100),'Sequence duplication level','Sequence','Number of duplicates','',0); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{dubslength}); + $html .= &insert_image($png); + } + } + if(exists $data->{dubslength} || exists $data->{dubscounts}) { + $html .= '

'; + } + + #Sequence complexity + if(exists $data->{compldust} || exists $data->{complentropy}) { + $html .= '
Sequence Complexity
'; + if(exists $data->{complvals}) { + my $complseq; + foreach my $d (keys %{$data->{complvals}}) { + foreach my $m ('minseq','maxseq') { + $complseq = $data->{complvals}->{$d}->{$m}; + $complseq = substr($complseq,0,797).'...' if(length($complseq) > 800); + $complseq =~ s/(.{60})/$1\
/g; + $data->{complvals}->{$d}->{$m} = $complseq; + } + } + } + $html .= '
ValueSequence
Minimum DUST score:'.(exists $data->{complvals}->{dust}->{minval} ? $data->{complvals}->{dust}->{minval} : '-').''.(exists $data->{complvals}->{dust}->{minseq} ? $data->{complvals}->{dust}->{minseq} : '').'
Maximum DUST score:'.(exists $data->{complvals}->{dust}->{maxval} ? $data->{complvals}->{dust}->{maxval} : '').''.(exists $data->{complvals}->{dust}->{maxseq} ? $data->{complvals}->{dust}->{maxseq} : '').'
Minimum Entropy value:'.(exists $data->{complvals}->{entropy}->{minval} ? $data->{complvals}->{entropy}->{minval} : '').''.(exists $data->{complvals}->{entropy}->{minseq} ? $data->{complvals}->{entropy}->{minseq} : '').'
Maximum Entropy value:'.(exists $data->{complvals}->{entropy}->{maxval} ? $data->{complvals}->{entropy}->{maxval} : '').''.(exists $data->{complvals}->{entropy}->{maxseq} ? $data->{complvals}->{entropy}->{maxseq} : '').'

'; + } + if(exists $data->{compldust}) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{compldust},0),undef,'Sequence complexity distribution','Mean sequence complexity (DUST scores)','Number of sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + } + if(exists $data->{complentropy}) { + $surface = &createAnnotBarPlot(&convertOdToBinMatrix($data->{complentropy},0),undef,'Sequence complexity distribution','Mean sequence complexity (Entropy values)','Number of sequences','',1); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

' if(exists $data->{compldust}); + $html .= &insert_image($png); + } + if(exists $data->{compldust} || exists $data->{complentropy}) { + $html .= '

'; + } + + #Dinucleotide odd ratio PCA - microbial/viral + if(exists $data->{dinucodds} && keys %{$data->{dinucodds}}) { + $html .= '
Dinucleotide Odds Ratios
'; + $html .= '
'; + foreach my $d (map {join("/",(m/../g ))} sort keys %{$data->{dinucodds}}) { + $html .= ''; + } + $html .= ''; + foreach my $d (map {sprintf("%.4f",$data->{dinucodds}->{$_})} sort keys %{$data->{dinucodds}}) { + $html .= ''; + } + $html .= '
 '.$d.'
Odds ratio'.$d.'

'; + my @new = map {$data->{dinucodds}->{$_}} sort keys %{$data->{dinucodds}}; + $surface = &createOddsRatioPlot($data->{dinucodds},'Odds ratios','Dinucleotide','Odds ratio',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= &insert_image($png); + $surface = &createPCAPlot(&convertToPCAValues(\@new,'m'),'PCA','1st Principal Component Score','2nd Principal Component Score',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

'; + $html .= &insert_image($png); + $surface = &createPCAPlot(&convertToPCAValues(\@new,'v'),'PCA','1st Principal Component Score','2nd Principal Component Score',''); + $png = ''; + $surface->write_to_png_stream(sub { my ($closure, $data) = @_; $png .= $data; }); + $html .= '

'; + $html .= &insert_image($png); + $html .= '
'; + } + + $html .= '
'; + $html .= &footer(); + + #write html to file + $file = &getFileName('.html'); + open(FH, ">$file") or &printError("Can't open file ".$file.": $!"); + print FH $html; + close(FH); + &printLog("Done with HTML data"); +} + +sub insert_image { + my ($data, $height, $width, $noencode) = @_; + my $content .= ''."\n"; + return $content; +} + +sub inline_image { + return "data:image/png;base64,".MIME::Base64::encode_base64($_[0]); +} + +sub convertIntToString { + my $int = shift; + $int =~ s/(.{2})/chr(hex($1))/eg; + return $int; +} diff -r 000000000000 -r 9790cfb46d03 prinseq-lite.pl --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/prinseq-lite.pl Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,4850 @@ +#!/usr/bin/perl + +#=============================================================================== +# Author: Robert SCHMIEDER, Computational Science Research Center @ SDSU, CA +# +# File: prinseq-lite +# Date: 2013-03-13 +# Version: 0.20.3 lite +# +# Usage: +# prinseq-lite [options] +# +# Try 'prinseq-lite -h' for more information. +# +# Purpose: PRINSEQ will help you to preprocess your genomic or metagenomic +# sequence data in FASTA or FASTQ format. The lite version does not +# require any non-core perl modules for processing. +# +# Bugs: Please use http://sourceforge.net/tracker/?group_id=315449 +# +#=============================================================================== + +use strict; +use warnings; + +#use Data::Dumper; ### +use Getopt::Long; +use Pod::Usage; +use File::Temp qw(tempfile); #for output files +use Fcntl qw(:flock SEEK_END); #for log file +use Digest::MD5 qw(md5_hex); #for dereplication +use Cwd; +use List::Util qw(sum min max); + +$| = 1; # Do not buffer output + +my $WINDOWSIZE = 64; +my $WINDOWSTEP = 32; +my $WORDSIZE = 3; +my @WINDOWSIZEARRAY = (0..61); +my $LOG62 = log(62); +my $ONEOVERLOG62 = 1/log(62); +my $POINTFIVE = 1/2; +my $LINE_WIDTH = 60; +my $TRIM_QUAL_WINDOW = 1; +my $TRIM_QUAL_STEP = 1; +my $TRIM_QUAL_TYPE = 'min'; +my $TRIM_QUAL_RULE = 'lt'; +my $TAG_LENGTH = 20; +my %MIDS = (ACGAGTGCGT => 0, + ACGCTCGACA => 0, + AGACGCACTC => 0, + AGCACTGTAG => 0, + ATCAGACACG => 0, + ATATCGCGAG => 0, + CGTGTCTCTA => 0, + CTCGCGTGTC => 0, + TAGTATCAGC => 0, + TCTCTATGCG => 0, + TGATACGTCT => 0, + TACTGAGCTA => 0, + CATAGTAGTG => 0, + CGAGAGATAC => 0, + ACACGACGACT => 0, + ACACGTAGTAT => 0, + ACACTACTCGT => 0, + ACGACACGTAT => 0, + ACGAGTAGACT => 0, + ACGCGTCTAGT => 0, + ACGTACACACT => 0, + ACGTACTGTGT => 0, + ACGTAGATCGT => 0, + ACTACGTCTCT => 0, + ACTATACGAGT => 0, + ACTCGCGTCGT => 0); +my $MIDCHECKLENGTH = 15; #maximum MID length plus possible key length (by default 4 bp for 454) +my %DN_DI = ('AA' => 0, 'AC' => 0, 'AG' => 0, 'AT' => 0, 'CA' => 0, 'CC' => 0, 'CG' => 0, 'CT' => 0, 'GA' => 0, 'GC' => 0, 'GG' => 0, 'GT' => 0, 'TA' => 0, 'TC' => 0, 'TG' => 0, 'TT' => 0); +my %GRAPH_OPTIONS = map {$_ => 1} qw(ld gc qd ns pt ts aq de da sc dn); +my $VERSION = '0.20.3'; +my $WHAT = 'lite'; + +my $man = 0; +my $help = 0; +my %params = ('help' => \$help, 'h' => \$help, 'man' => \$man); +GetOptions( \%params, + 'help|h', + 'man', + 'verbose', + 'version' => sub { print "PRINSEQ-$WHAT $VERSION\n"; exit; }, + 'fastq=s', + 'fasta=s', + 'fastq2=s', + 'fasta2=s', + 'qual=s', + 'min_len=i', + 'max_len=i', + 'range_len=s', + 'min_gc=i', + 'max_gc=i', + 'range_gc=s', + 'min_qual_score=i', + 'max_qual_score=i', + 'min_qual_mean=i', + 'max_qual_mean=i', + 'ns_max_p=i', + 'ns_max_n=i', + 'noniupac', + 'seq_num=i', + 'derep=i', + 'derep_min=i', + 'lc_method=s', + 'lc_threshold=i', + 'trim_to_len=i', + 'trim_left=i', + 'trim_right=i', + 'trim_left_p=i', + 'trim_right_p=i', + 'trim_tail_left=i', + 'trim_tail_right=i', + 'trim_ns_left=i', + 'trim_ns_right=i', + 'trim_qual_left=i', + 'trim_qual_right=i', + 'trim_qual_type=s', + 'trim_qual_rule=s', + 'trim_qual_window=i', + 'trim_qual_step=i', + 'seq_case=s', + 'dna_rna=s', + 'line_width=i', + 'rm_header', + 'seq_id=s', + 'seq_id_mappings:s', + 'out_format=i', + 'out_good=s', + 'out_bad=s', + 'stats_len', + 'stats_dinuc', + 'stats_info', + 'stats_tag', + 'stats_dupl', + 'stats_ns', + 'stats_assembly', + 'stats_all', + 'aa', + 'log:s', + 'graph_data:s', + 'graph_stats=s', + 'phred64', + 'qual_noscale', + 'no_qual_header', + 'exact_only', + 'web:s', + 'filename1=s', + 'filename2=s', + 'custom_params=s', + 'params=s' + ) or pod2usage(2); +pod2usage(1) if $help; +pod2usage(-exitstatus => 0, -verbose => 2) if $man; + +=head1 NAME + +PRINSEQ - PReprocessing and INformation of SEQuence data + +=head1 VERSION + +PRINSEQ-lite 0.20.3 + +=head1 SYNOPSIS + +perl prinseq-lite.pl [-h] [-help] [-version] [-man] [-verbose] [-fastq input_fastq_file] [-fasta input_fasta_file] [-fastq2 input_fastq_file_2] [-fasta2 input_fasta_file_2] [-qual input_quality_file] [-min_len int_value] [-max_len int_value] [-range_len ranges] [-min_gc int_value] [-max_gc int_value] [-range_gc ranges] [-min_qual_score int_value] [-max_qual_score int_value] [-min_qual_mean int_value] [-max_qual_mean int_value] [-ns_max_p int_value] [-ns_max_n int_value] [-noniupac] [-seq_num int_value] [-derep int_value] [-derep_min int_value] [-lc_method method_name] [-lc_threshold int_value] [-trim_to_len int_value] [-trim_left int_value] [-trim_right int_value] [-trim_left_p int_value] [-trim_right_p int_value] [-trim_ns_left int_value] [-trim_ns_right int_value] [-trim_tail_left int_value] [-trim_tail_right int_value] [-trim_qual_left int_value] [-trim_qual_right int_value] [-trim_qual_type type] [-trim_qual_rule rule] [-trim_qual_window int_value] [-trim_qual_step int_value] [-seq_case case] [-dna_rna type] [-line_width int_value] [-rm_header] [-seq_id id_string] [-out_format int_value] [-out_good filename_prefix] [-out_bad filename_prefix] [-phred64] [-stats_info] [-stats_len] [-stats_dinuc] [-stats_tag] [-stats_dupl] [-stats_ns] [-stats_assembly] [-stats_all] [-aa] [-graph_data file] [-graph_stats string] [-qual_noscale] [-no_qual_header] [-exact_only] [-log file] [-custom_params string] [-params file] [-seq_id_mappings file] + +=head1 DESCRIPTION + +PRINSEQ will help you to preprocess your genomic or metagenomic sequence data in FASTA (and QUAL) or FASTQ format. The lite version does not require any non-core perl modules for processing. + +=head1 OPTIONS + +=over 8 + +=item B<-help> | B<-h> + +Print the help message; ignore other arguments. + +=item B<-man> + +Print the full documentation; ignore other arguments. + +=item B<-version> + +Print program version; ignore other arguments. + +=item B<-verbose> + +Prints status and info messages during processing. + +=item B<***** INPUT OPTIONS *****> + +=item B<-fastq> + +Input file in FASTQ format that contains the sequence and quality data. Use stdin instead of a file name to read from STDIN (-fasta stdin). This can be useful to process compressed files using Unix pipes. + +=item B<-fasta> + +Input file in FASTA format that contains the sequence data. Use stdin instead of a file name to read from STDIN (-fastq stdin). This can be useful to process compressed files using Unix pipes. + +=item B<-qual> + +Input file in QUAL format that contains the quality data. + +=item B<-fastq2> + +For paired-end data only. Input file in FASTQ format that contains the sequence and quality data. The sequence identifiers for two matching paired-end sequences in separate files can be marked by /1 and /2, or _L and _R, or _left and _right, or must have the exact same identifier in both input files. The input sequences must be sorted by their sequence identifiers. Singletons are allowed in the input files. + +=item B<-fasta2> + +For paired-end data only. Input file in FASTA format that contains the sequence data. The sequence identifiers for two matching paired-end sequences in separate files can be marked by /1 and /2, or _L and _R, or _left and _right, or must have the exact same identifier in both input files. The input sequences must be sorted by their sequence identifiers. Singletons are allowed in the input files. + +=item B<-params> + +Input file in text format that contains PRINSEQ parameters. Each parameter should be specified on a new line and arguments should be separated by spaces or tabs. Comments can be specified on lines starting with the # sign. Can be combined with command line parameters. Parameters specified on the command line will overwrite the arguments in the file (if any). + +=item B<-si13> + +This option was replaced by option -phred64. + +=item B<-phred64> + +Quality data in FASTQ file is in Phred+64 format (http://en.wikipedia.org/wiki/FASTQ_format#Encoding). Not required for Illumina 1.8+, Sanger, Roche/454, Ion Torrent, PacBio data. + +=item B<-aa> + +Input is amino acid (protein) sequences instead of nucleic acid (DNA or RNA) sequences. Allowed amino acid characters: ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*- and allowed nucleic acid characters: ACGTURYKMSWBDHVNXacgturykmswbdhvnx- + +The following options are ignored for -aa: stats_dinuc,stats_tag,stats_ns,dna_rna + +=item B<***** OUTPUT OPTIONS *****> + +=item B<-out_format> + +To change the output format, use one of the following options. If not defined, the output format will be the same as the input format. + +1 (FASTA only), 2 (FASTA and QUAL), 3 (FASTQ), 4 (FASTQ and FASTA), or 5 (FASTQ, FASTA and QUAL) + +=item B<-out_good> + +By default, the output files are created in the same directory as the input file containing the sequence data with an additional "_prinseq_good_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option. The file extension will be added automatically (either .fasta, .qual, or .fastq). For paired-end data, filenames contain additionally "_1", "_1_singletons", "_2", and "_2_singletons" before the file extension. Use "-out_good null" to prevent the program from generating the output file(s) for data passing all filters. Use "-out_good stdout" to write data passing all filters to STDOUT (only for FASTA or FASTQ output files). + +Example: use "file_passed" to generate the output file file_passed.fasta in the current directory + +=item B<-out_bad> + +By default, the output files are created in the same directory as the input file containing the sequence data with an additional "_prinseq_bad_XXXX" in their name (where XXXX is replaced by random characters to prevent overwriting previous files). To change the output filename and location, specify the filename using this option. The file extension will be added automatically (either .fasta, .qual, or .fastq). For paired-end data, filenames contain additionally "_1" and "_2" before the file extension. Use "-out_bad null" to prevent the program from generating the output file(s) for data not passing any filter. Use "-out_bad stdout" to write data not passing any filter to STDOUT (only for FASTA or FASTQ output files). + +Example: use "file_filtered" to generate the output file file_filtered.fasta in the current directory + +Example: "-out_good stdout -out_bad null" will write data passing filters to STDOUT and data not passing any filter will be ignored + +=item B<-log> + +Log file to keep track of parameters, errors, etc. The log file name is optional. If no file name is given, the log file name will be "inputname.log". If the log file already exists, new content will be added to the file. + +=item B<-graph_data> + +File that contains the necessary information to generate the graphs similar to the ones in the web version. The file name is optional. If no file name is given, the file name will be "inputname.gd". If the file already exists, new content will overwrite the file. Use "-out_good null -out_bad null" to prevent generating any additional outputs. (See below for more options related to the graph data.) + +The graph data can be used as input for the prinseq-graphs.pl file to generate the PNG graph files or an HTML report file. If you have trouble installing the required prinseq-graphs.pl modules or want to see an output +example report, upload the graph data file at: http://edwards.sdsu.edu/prinseq/ -> Choose "Get Report" + +=item B<-graph_stats> + +Use this option to select what statistics should be calculated and included in the graph_data file. This is useful if you e.g. do not need sequence complexity information, which requires a lot of computation. Requires to have graph_data specified. Default is all selected. + +Allowed option are (separate multiple by comma with no spaces): ld (Length distribution), gc (GC content distribution), qd (Base quality distribution), ns (Occurence of N), pt (Poly-A/T tails), ts (Tag sequence check), aq (Assembly quality measure), de (Sequence duplication - exact only), da (Sequence duplication - exact + 5'/3'), sc (Sequence complexity), dn (Dinucleotide odds ratios, includes the PCA plots) + +Example use: -graph_stats ld,gc,qd,de + +=item B<-qual_noscale> + +Use this option if all your sequences are shorter than 100bp as they do not require to scale quality data to 100 data points in the graph. By default, quality scores of sequences shorter than 100bp or longer than 100bp are fit to 100 data points. (To retrieve this information and calculate the graph data would otherwise require to parse the data two times or store all the quality data in memory.) + +=item B<-no_qual_header> + +In order to reduce the file size, this option will generate an empty header line for the quality data in FASTQ files. Instead of +header, only the + sign will be output. The header of the sequence data will be left unchanged. This option applies to FASTQ output files only. + +=item B<-exact_only> + +Use this option to check for exact (forward and reverse) duplicates only when generating the graph data. This allows to keep the memory requirements low for large input files and is faster. This option will automatically be applied when using -derep options 1 and/or 4 only. Specify option -derep 1 or -derep 4 if you do not want to apply both at the same time. + +=item B<-seq_id_mappings> + +Text file containing the old and new (specified with -seq_id) identifiers for later reference. This option is useful if e.g. a renamed sequence has to be identified based on the new sequence identifier. The file name is optional. If no file name is given, the file name will be "inputname_prinseq_good.ids" (only good sequences are renamed). If a file with the same name already exists, new content will overwrite the old file. The text file contains one sequence identifier pair per line, separated by tabs (old-tab-new). Requires option -seq_id. + + +=item B<***** FILTER OPTIONS *****> + +=item B<-min_len> + +Filter sequence shorter than min_len. + +=item B<-max_len> + +Filter sequence longer than max_len. + +=item B<-range_len> + +Filter sequence by length range. Multiple range values should be separated by comma without spaces. + +Example: -range_len 50-100,250-300 + +=item B<-min_gc> + +Filter sequence with GC content below min_gc. + +=item B<-max_gc> + +Filter sequence with GC content above max_gc. + +=item B<-range_gc> + +Filter sequence by GC content range. Multiple range values should be separated by comma without spaces. + +Example: -range_gc 50-60,75-90 + +=item B<-min_qual_score> + +Filter sequence with at least one quality score below min_qual_score. + +=item B<-max_qual_score> + +Filter sequence with at least one quality score above max_qual_score. + +=item B<-min_qual_mean> + +Filter sequence with quality score mean below min_qual_mean. + +=item B<-max_qual_mean> + +Filter sequence with quality score mean above max_qual_mean. + +=item B<-ns_max_p> + +Filter sequence with more than ns_max_p percentage of Ns. + +=item B<-ns_max_n> + +Filter sequence with more than ns_max_n Ns. + +=item B<-noniupac> + +Filter sequence with characters other than A, C, G, T or N. + +=item B<-seq_num> + +Only keep the first seq_num number of sequences (that pass all other filters). + +=item B<-derep> + +Type of duplicates to filter. Allowed values are 1, 2, 3, 4 and 5. Use integers for multiple selections (e.g. 124 to use type 1, 2 and 4). The order does not matter. Option 2 and 3 will set 1 and option 5 will set 4 as these are subsets of the other option. + +1 (exact duplicate), 2 (5' duplicate), 3 (3' duplicate), 4 (reverse complement exact duplicate), 5 (reverse complement 5'/3' duplicate) + +=item B<-derep_min> + +This option specifies the number of allowed duplicates. If you want to remove sequence duplicates that occur more than x times, then you would specify x+1 as the -derep_min values. For examples, to remove sequences that occur more than 5 times, you would specify -derep_min 6. This option can only be used in combination with -derep 1 and/or 4 (forward and/or reverse exact duplicates). [default : 2] + +=item B<-lc_method> + +Method to filter low complexity sequences. The current options are "dust" and "entropy". Use "-lc_method dust" to calculate the complexity using the dust method. + +=item B<-lc_threshold> + +The threshold value (between 0 and 100) used to filter sequences by sequence complexity. The dust method uses this as maximum allowed score and the entropy method as minimum allowed value. + +=item B<-custom_params> + +Can be used to specify additional filters. The current set of possible rules is limited and has to follow the specifications below. The custom parameters have to be specified within quotes (either ' or "). + +Please separate parameter values with a space and separate new parameter sets with semicolon (;). Parameters are defined by two values: + (1) the pattern (any combination of the letters "ACGTN"), + (2) the number of repeats or percentage of occurence +Percentage values are defined by a number followed by the %-sign (without space). +If no %-sign is given, it is assumed that the given number specifies the number of repeats of the pattern. + +Examples: "AAT 10" (filters out sequences containing AATAATAATAATAATAATAATAATAATAAT anywhere in the sequence), "T 70%" (filters out sequences with more than 70% Ts in the sequence), "A 15" (filters out sequences containing AAAAAAAAAAAAAAA anywhere in the sequence), "AAT 10;T 70%;A 15" (apply all three filters) + +=item B<***** TRIM OPTIONS *****> + +=item B<-trim_to_len> + +Trim all sequence from the 3'-end to result in sequence with this length. + +=item B<-trim_left> + +Trim sequence at the 5'-end by trim_left positions. + +=item B<-trim_right> + +Trim sequence at the 3'-end by trim_right positions. + +=item B<-trim_left_p> + +Trim sequence at the 5'-end by trim_left_p percentage of read length. The trim length is rounded towards the lower integer (e.g. 143.6 is rounded to 143 positions). Use an integer between 1 and 100 for the percentage value. + +=item B<-trim_right_p> + +Trim sequence at the 3'-end by trim_right_p percentage of read length. The trim length is rounded towards the lower integer (e.g. 143.6 is rounded to 143 positions). Use an integer between 1 and 100 for the percentage value. + +=item B<-trim_tail_left> + +Trim poly-A/T tail with a minimum length of trim_tail_left at the 5'-end. + +=item B<-trim_tail_right> + +Trim poly-A/T tail with a minimum length of trim_tail_right at the 3'-end. + +=item B<-trim_ns_left> + +Trim poly-N tail with a minimum length of trim_ns_left at the 5'-end. + +=item B<-trim_ns_right> + +Trim poly-N tail with a minimum length of trim_ns_right at the 3'-end. + +=item B<-trim_qual_left> + +Trim sequence by quality score from the 5'-end with this threshold score. + +=item B<-trim_qual_right> + +Trim sequence by quality score from the 3'-end with this threshold score. + +=item B<-trim_qual_type> + +Type of quality score calculation to use. Allowed options are min, mean, max and sum. [default: min] + +=item B<-trim_qual_rule> + +Rule to use to compare quality score to calculated value. Allowed options are lt (less than), gt (greater than) and et (equal to). [default: lt] + +=item B<-trim_qual_window> + +The sliding window size used to calculate quality score by type. To stop at the first base that fails the rule defined, use a window size of 1. [default: 1] + +=item B<-trim_qual_step> + +Step size used to move the sliding window. To move the window over all quality scores without missing any, the step size should be less or equal to the window size. [default: 1] + +=item B<***** REFORMAT OPTIONS *****> + +=item B<-seq_case> + +Changes sequence character case to upper or lower case. Allowed options are "upper" and "lower". Use this option to remove soft-masking from your sequences. + +=item B<-dna_rna> + +Convert sequence between DNA and RNA. Allowed options are "dna" (convert from RNA to DNA) and "rna" (convert from DNA to RNA). + +=item B<-line_width> + +Sequence characters per line. Use 0 if you want each sequence in a single line. Use 80 for line breaks every 80 characters. Note that this option only applies to FASTA output files, since FASTQ files store sequences without additional line breaks. [default: 60] + +=item B<-rm_header> + +Remove the sequence header. This includes everything after the sequence identifier (which is kept unchanged). + +=item B<-seq_id> + +Rename the sequence identifier. A counter is added to each identifier to assure its uniqueness. Use option -seq_id_mappings to generate a file containing the old and new identifiers for later reference. + +Example: "mySeq_10" will generate the IDs (in FASTA format) >mySeq_101, >mySeq_102, >mySeq_103, ... + +=item B<***** SUMMARY STATISTIC OPTIONS *****> + +The summary statistic values are written to STDOUT in the form: "parameter_name statistic_name value" (without the quotes). For example, "stats_info reads 10000" or "stats_len max 500". Only one statistic is written per line and values are separated by tabs. + +If you specify any statistic option, no other ouput will be generated. To preprocess data, do not specify a statistics option. + +=item B<-stats_info> + +Outputs basic information such as number of reads (reads) and total bases (bases). + +=item B<-stats_len> + +Outputs minimum (min), maximum (max), range (range), mean (mean), standard deviation (stddev), mode (mode) and mode value (modeval), and median (median) for read length. + +=item B<-stats_dinuc> + +Outputs the dinucleotide odds ratio for AA/TT (aatt), AC/GT (acgt), AG/CT (agct), AT (at), CA/TG (catg), CC/GG (ccgg), CG (cg), GA/TC (gatc), GC (gc) and TA (ta). + +=item B<-stats_tag> + +Outputs the probability of a tag sequence at the 5'-end (prob5) and 3'-end (prob3) in percentage (0..100). Provides the number of predefined MIDs (midnum) and the MID sequences (midseq, separated by comma, only provided if midnum > 0) that occur in more than 34/100 (approx. 3%) of the reads. + +=item B<-stats_dupl> + +Outputs the number of exact duplicates (exact), 5' duplicates (5), 3' duplicates (3), exact duplicates with reverse complements (exactrevcom) and 5'/3' duplicates with reverse complements (revcomp), and total number of duplicates (total). The maximum number of duplicates is given under the value name with an additional "maxd" (e.g. exactmaxd or 5maxd). + +=item B<-stats_ns> + +Outputs the number of reads with ambiguous base N (seqswithn), the maximum number of Ns per read (maxn) and the maximum percentage of Ns per read (maxp). The maxn and maxp value are not necessary from the same sequence. + +=item B<-stats_assembly> + +Outputs the N50, N90, etc contig sizes. The Nxx contig size is a weighted median that is defined as the length of the smallest contig C in the sorted list of all contigs where the cumulative length from the largest contig to contig C is at least xx% of the total length (sum of contig lengths). + +=item B<-stats_all> + +Outputs all available summary statistics. + +=item B<***** ORDER OF PROCESSING *****> + +The available options are processed in the following order: + +seq_num, trim_left, trim_right, trim_left_p, trim_right_p, trim_qual_left, trim_qual_right, trim_tail_left, trim_tail_right, trim_ns_left, trim_ns_right, trim_to_len, min_len, max_len, range_len, min_qual_score, max_qual_score, min_qual_mean, max_qual_mean, min_gc, max_gc, range_gc, ns_max_p, ns_max_n, noniupac, lc_method, derep, seq_id, seq_case, dna_rna, out_format + +=back + +=head1 AUTHOR + +Robert SCHMIEDER, C<< >> + +=head1 BUGS + +If you find a bug please email me at C<< >> or use http://sourceforge.net/tracker/?group_id=315449 so that I can make PRINSEQ better. + +=head1 COPYRIGHT + +Copyright (C) 2010-2012 Robert SCHMIEDER + +=head1 LICENSE + +This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. + +This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. + +You should have received a copy of the GNU General Public License along with this program. If not, see . + +=cut + +# +################################################################################ +## DATA AND PARAMETER CHECKING +################################################################################ +# + +my ($file1,$file2,$command,@dataread,$aa,%filtercount,$slashnum,$trimnum1,$trimnum2); + +#check if params file +if(exists $params{params}) { + my $ps = &readParamsFile($params{params}); + foreach my $p (keys %$ps) { + next if(exists $params{$p}); + $params{$p} = $ps->{$p}; + } +} + +#check if amino acid or nucleic acid input +if(exists $params{aa}) { + $aa = 1; + $command .= ' -aa'; +} else { + $aa = 0; +} + +#Check if input file exists and check if file format is correct +if(exists $params{fasta} && exists $params{fastq}) { + &printError('fasta and fastq cannot be used together'); +} elsif(exists $params{fasta}) { + $command .= ' -fasta '.$params{fasta}; + $file1 = $params{fasta}; + if($params{fasta} eq 'stdin') { + if(exists $params{qual} && $params{qual} eq 'stdin') { + &printError('input from STDIN is only allowed for either of the input files'); + } else { + my $format = &checkInputFormat(); + unless($format eq 'fasta') { + &printError('input data for -fasta is in '.uc($format).' format not in FASTA format'); + } + } + } elsif(-e $params{fasta}) { + #check for file format + my $format = &checkFileFormat($file1); + unless($format eq 'fasta') { + &printError('input file for -fasta is in '.uc($format).' format not in FASTA format'); + } + } else { + &printError("could not find input file \"".$params{fasta}."\""); + } +} elsif(exists $params{fastq}) { + $command .= ' -fastq '.$params{fastq}; + $file1 = $params{fastq}; + if($params{fastq} eq 'stdin') { + my $format = &checkInputFormat(); + unless($format eq 'fastq') { + &printError('input data for -fastq is in '.uc($format).' format not in FASTQ format'); + } + } elsif(-e $params{fastq}) { + #check for file format + my $format = &checkFileFormat($file1); + unless($format eq 'fastq') { + &printError('input file for -fastq is in '.uc($format).' format not in FASTQ format'); + } + } else { + &printError("could not find input file \"".$params{fastq}."\""); + } +} else { + &printError("you did not specify an input file containing the query sequences"); +} +if(exists $params{fastq} && exists $params{qual}) { + &printError('fastq and qual cannot be used together'); +} elsif(exists $params{qual}) { + $command .= ' -qual '.$params{qual}; + if($params{qual} eq 'stdin') { + &printError('QUAL data cannot be read from STDIN'); + } elsif(-e $params{qual}) { + #check for file format + my $format = &checkFileFormat($params{qual}); + unless($format eq 'qual') { + &printError('input file for -qual is in '.uc($format).' format not in QUAL format'); + } + } else { + &printError("could not find input file \"".$params{qual}."\""); + } +} +if(exists $params{fasta2} && exists $params{fastq2}) { + &printError('fasta2 and fastq2 cannot be used together'); +} elsif(exists $params{fasta2}) { + if(!exists $params{fasta}) { + &printError('option fasta2 requires option fasta'); + } elsif($params{fasta} eq $params{fasta2}) { + &printError('option fasta and fasta2 cannot be the same input file'); + } else { + $command .= ' -fasta2 '.$params{fasta2}; + $file2 = $params{fasta2}; + if($params{fasta} eq 'stdin' || $params{fasta2} eq 'stdin') { + &printError('paired-end data cannot be processed from STDIN'); + } elsif(-e $params{fasta2}) { + #check for file format + my $format = &checkFileFormat($file2); + unless($format eq 'fasta') { + &printError('input file for -fasta2 is in '.uc($format).' format not in FASTA format'); + } + } else { + &printError("could not find input file \"".$params{fasta2}."\""); + } + } + ($slashnum,$trimnum1,$trimnum2) = &checkSlashnum($file2); +} elsif(exists $params{fastq2}) { + if(!exists $params{fastq}) { + &printError('option fastq2 requires option fastq'); + } elsif($params{fastq} eq $params{fastq2}) { + &printError('option fastq and fastq2 cannot be the same input file'); + } else { + $command .= ' -fastq2 '.$params{fastq2}; + $file2 = $params{fastq2}; + if($params{fastq} eq 'stdin' || $params{fastq2} eq 'stdin') { + &printError('paired-end data cannot be processed from STDIN'); + } elsif(-e $params{fastq2}) { + #check for file format + my $format = &checkFileFormat($file2); + unless($format eq 'fastq') { + &printError('input file for -fastq2 is in '.uc($format).' format not in FASTQ format'); + } + } else { + &printError("could not find input file \"".$params{fastq2}."\""); + } + } + ($slashnum,$trimnum1,$trimnum2) = &checkSlashnum($file2); +} + + +#check if stats_all +if(exists $params{stats_all}) { + $params{stats_info} = 1; + $params{stats_len} = 1; + $params{stats_dupl} = 1 unless($file2); + $params{stats_dinuc} = 1; + $params{stats_tag} = 1; + $params{stats_ns} = 1; + $params{stats_assembly} = 1; + delete($params{stats_all}); +} +if($aa) { + delete($params{stats_dinuc}); + delete($params{stats_tag}); + delete($params{stats_ns}); +} +if($file2) { + delete($params{stats_dupl}); + delete($params{stats_assembly}); +} + +#check if anything todo +unless( exists $params{min_len} || + exists $params{max_len} || + exists $params{range_len} || + exists $params{min_gc} || + exists $params{max_gc} || + exists $params{range_gc} || + exists $params{min_qual_score} || + exists $params{max_qual_score} || + exists $params{min_qual_mean} || + exists $params{max_qual_mean} || + exists $params{ns_max_p} || + exists $params{ns_max_n} || + exists $params{noniupac} || + exists $params{seq_num} || + exists $params{derep} || + exists $params{lc_method} || + exists $params{lc_threshold} || + exists $params{trim_to_len} || + exists $params{trim_left} || + exists $params{trim_right} || + exists $params{trim_left_p} || + exists $params{trim_right_p} || + exists $params{trim_tail_left} || + exists $params{trim_tail_right} || + exists $params{trim_ns_left} || + exists $params{trim_ns_right} || + exists $params{trim_qual_left} || + exists $params{trim_qual_right} || + exists $params{trim_qual_type} || + exists $params{trim_qual_rule} || + exists $params{trim_qual_window} || + exists $params{trim_qual_step} || + exists $params{seq_case} || + exists $params{dna_rna} || + exists $params{exact_only} || + exists $params{line_width} || + exists $params{rm_header} || + exists $params{seq_id} || + exists $params{out_format} || + exists $params{stats_info} || + exists $params{stats_len} || + exists $params{stats_dinuc} || + exists $params{stats_tag} || + exists $params{stats_dupl} || + exists $params{stats_ns} || + exists $params{stats_assembly} || + exists $params{phred64} || + exists $params{no_qual_header} || + exists $params{graph_data} || + exists $params{custom_params} + ) { + &printError('nothing to do with input data'); +} +#prevent out of files for stats +if(exists $params{stats_info} || exists $params{stats_len} || exists $params{stats_dinuc} || exists $params{stats_tag} || exists $params{stats_dupl} || exists $params{stats_ns} || exists $params{stats_assembly}) { + $params{out_good} = 'null'; + $params{out_bad} = 'null'; + $params{stats} = 1; +} elsif(exists $params{out_good} && $params{out_good} eq 'null' && exists $params{out_bad} && $params{out_bad} eq 'null' && !exists $params{graph_data}) { + &printError('no output selected (both set to null)'); +} + +#check if FASTQ file is given for option phred64 +if(exists $params{phred64}) { + $command .= ' -phred64'; + unless(exists $params{fastq}) { + &printError('option -phred64 can only be used for FASTQ input files'); + } +} + +#check if output format is possible +if(exists $params{out_format}) { + $command .= ' -out_format '.$params{out_format}; + if($params{out_format} =~ /\D/) { + &printError('output format option has to be an integer value'); + } elsif($params{out_format} == 2 || $params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5) { + unless(exists $params{fastq} || exists $params{qual}) { + &printError('cannot use this output format option without providing quality data as input'); + } + if(exists $params{fastq2} && ($params{out_format} == 2 || $params{out_format} == 4 || $params{out_format} == 5)) { + &printError('cannot use this output format option for paired-end input data. Only values 1 and 3 are allowed'); + } + } elsif($params{out_format} != 1) { + &printError('output format option not available'); + } +} else { + if(exists $params{fastq}) { + $params{out_format} = 3; + } elsif(exists $params{fasta} && exists $params{qual}) { + $params{out_format} = 2; + } else { + $params{out_format} = 1; + } +} +if(exists $params{no_qual_header} && $params{out_format} != 3) { + &printError('the option -no_qual_header can only be used for FASTQ outputs'); +} + +#check if output names are different +if(exists $params{out_good} && exists $params{out_bad} && $params{out_good} eq $params{out_bad} && $params{out_good} ne 'null' && $params{out_good} ne 'stdout') { + &printError('the output names for -out_good and -out_bad have to be different'); +} +#check if output can be written to standard output +if(($params{out_format} == 2 || $params{out_format} == 4 || $params{out_format} == 5) && ((exists $params{out_good} && $params{out_good} eq 'stdout') || (exists $params{out_bad} && $params{out_bad} eq 'stdout'))) { + &printError('the output cannot be written to STDOUT for multiple output files. This option can only be used for FASTA only (-out_format 1) or FASTQ output (-out_format 3)'); +} + +#check dereplication option +#1 - exact dub, 2 - prefix, 3 - suffix, 4 - revcomp exact, 5 - revcomp prefix/suffix +my $derep = 0; +my %dereptypes; +my $derepmin = 2; +if(exists $params{derep}) { + $command .= ' -derep '.$params{derep}; + if($params{derep} < 0 || $params{derep} > 54321) { + &printError('invalid option for dereplication'); + } else { + my @tmp = split('',$params{derep}); + foreach(@tmp) { + if($_ < 1 || $_ > 5) { + &printError('invalid option '.$_.'for dereplication'); + } else { + $derep = 1; + $dereptypes{($_-1)} = 0; + } + } + } +} +if(!exists $dereptypes{0} && (exists $dereptypes{1} || exists $dereptypes{2})) { + $dereptypes{0} = 0; +} +if(!exists $dereptypes{3} && exists $dereptypes{4}) { + $dereptypes{3} = 0; +} +my $exactonly = 0; +if(exists $params{exact_only}) { + $command .= ' -exact_only'; + if(exists $dereptypes{1} || exists $dereptypes{2} || exists $dereptypes{4}) { + &printError('option -exact_only can only be used with -derep options 1 and/or 4'); + } + if(!exists $params{graph_data}) { + &printError('option -exact_only requires option -graph_data'); + } + if(!exists $params{derep}) { + $dereptypes{0} = 1; + $dereptypes{3} = 1; + } + $exactonly = 1; +} elsif((exists $dereptypes{0} || exists $dereptypes{3}) && !exists $dereptypes{1} && !exists $dereptypes{2} && !exists $dereptypes{4}) { + $command .= ' -exact_only'; + $exactonly = 1; + $params{exact_only} = 1; +} +if(exists $params{derep_min}) { + $command .= ' -derep_min '.$params{derep_min}; + if($params{derep_min} < 2) { + &printError('invalid option '.$params{derep_min}.'for derep_min. The values has to be greater than 1'); + } elsif(exists $dereptypes{1} || exists $dereptypes{2} || exists $dereptypes{4}) { + &printError('option -derep_min can only be used with -derep options 1 and/or 4'); + } else { + $derepmin = $params{derep_min}; + } +} + +#check for low complexity method +my $complval; +if(exists $params{lc_method}) { + $command .= ' -lc_method '.$params{lc_method}; + unless($params{lc_method} eq 'dust' || $params{lc_method} eq 'entropy') { + &printError('invalid low complexity method'); + } + unless(exists $params{lc_threshold}) { + &printError('the low complexity method requires a threshold value specified by -lc_threshold'); + } + $command .= ' -lc_threshold '.$params{lc_threshold}; + $complval = $params{lc_threshold}; +} +if(exists $params{lc_threshold} && !exists $params{lc_method}) { + &printError('the low complexity threshold requires a method specified by -lc_method'); +} + +#check for quality trimming +my $trimscore; +if(exists $params{trim_qual_left} || exists $params{trim_qual_right}) { + $command .= ' -trim_qual_right '.$params{trim_qual_right} if(exists $params{trim_qual_right}); + $command .= ' -trim_qual_left '.$params{trim_qual_left} if(exists $params{trim_qual_left}); + if(exists $params{trim_qual_type}) { + unless($params{trim_qual_type} eq 'min' || $params{trim_qual_type} eq 'mean' || $params{trim_qual_type} eq 'max' || $params{trim_qual_type} eq 'sum') { + &printError('invalid value for trim_qual_type'); + } + } else { + $params{trim_qual_type} = $TRIM_QUAL_TYPE; + } + $command .= ' -trim_qual_type '.$params{trim_qual_type}; + if(exists $params{trim_qual_rule}) { + unless($params{trim_qual_rule} eq 'lt' || $params{trim_qual_rule} eq 'gt' || $params{trim_qual_rule} eq 'et') { + &printError('invalid value for trim_qual_rule'); + } + } else { + $params{trim_qual_rule} = $TRIM_QUAL_RULE; + } + $command .= ' -trim_qual_rule '.$params{trim_qual_rule}; + unless(exists $params{trim_qual_window}) { + $params{trim_qual_window} = $TRIM_QUAL_WINDOW; + } + $command .= ' -trim_qual_window '.$params{trim_qual_window}; + unless(exists $params{trim_qual_step}) { + $params{trim_qual_step} = $TRIM_QUAL_STEP; + } + $command .= ' -trim_qual_step '.$params{trim_qual_step}; + $trimscore = 1; +} + +#check sequence case +if(exists $params{seq_case}) { + $command .= ' -seq_case '.$params{seq_case}; + unless($params{seq_case} eq 'upper' || $params{seq_case} eq 'lower') { + &printError('invalid sequence case option'); + } +} + +#check for dna/rna +if(exists $params{dna_rna}) { + $command .= ' -dna_rna '.$params{dna_rna}; + unless($params{dna_rna} eq 'dna' || $params{dna_rna} eq 'rna') { + &printError('invalid option for -dna_rna'); + } + if($aa) { + &printError('option -dna_rna cannot be used with option -aa'); + } +} + +#set remaining parameters +my $linelen; +if($params{out_format} == 3) { + $linelen = 0; +} elsif(exists $params{line_width}) { + $linelen = $params{line_width}; + $command .= ' -line_width '.$params{line_width}; +} else { + $linelen = $LINE_WIDTH; +} + +if(exists $params{seq_id}) { + $command .= ' -seq_id '.$params{seq_id}; + #remove spaces, ">" and quotes from sequence ids + $params{seq_id} =~ s/[\s\>\"\'\`]//g; +} elsif(exists $params{seq_id_mappings}) { + &printError('option -seq_id_mappings requires option -seq_id'); +} +if(exists $params{seq_id_mappings}) { + $command .= ' -seq_id_mappings'.($params{seq_id_mappings} ? ' '.$params{seq_id_mappings} : ''); +} + +my ($repAleft,$repTleft,$repAright,$repTright,$repNleft,$repNright); +if(exists $params{trim_tail_left}) { + $command .= ' -trim_tail_left '.$params{trim_tail_left}; + $repAleft = 'A'x$params{trim_tail_left}; + $repAleft = qr/^$repAleft/; + $repTleft = 'T'x$params{trim_tail_left}; + $repTleft = qr/^$repTleft/; +} +if(exists $params{trim_tail_right}) { + $command .= ' -trim_tail_right '.$params{trim_tail_right}; + $repAright = 'A'x$params{trim_tail_right}; + $repAright = qr/$repAright$/; + $repTright = 'T'x$params{trim_tail_right}; + $repTright = qr/$repTright$/; +} +if(exists $params{trim_ns_left}) { + $command .= ' -trim_ns_left '.$params{trim_ns_left}; + $repNleft = 'N'x$params{trim_ns_left}; + $repNleft = qr/^$repNleft/; +} +if(exists $params{trim_ns_right}) { + $command .= ' -trim_ns_right '.$params{trim_ns_right}; + $repNright = 'N'x$params{trim_ns_right}; + $repNright = qr/$repNright$/; +} + +#graph data file +if(exists $params{graph_data}) { + if(exists $params{stats}) { + &printError("The graph data cannot be generated at the same time as the statistics"); + } + $command .= ' -graph_data'.($params{graph_data} ? ' '.$params{graph_data} : ''); + unless($params{graph_data}) { + $params{graph_data} = join("__",$file1||'nonamegiven').'.gd'; + } + $params{graph_data} = cwd().'/'.$params{graph_data} unless($params{graph_data} =~ /^\//); +} +my $scale = 1; +if(exists $params{qual_noscale}) { + $command .= ' -qual_noscale'; + $scale = 0; +} +#graph data selection +if(exists $params{graph_stats} && !exists $params{graph_data}) { + &printError('option -graph_stats requires option -graph_data'); +} +my %webstats = %GRAPH_OPTIONS; +my %graphstats = %GRAPH_OPTIONS; +if(exists $params{graph_stats}) { + $command .= ' -graph_stats'.($params{graph_stats} ? ' '.$params{graph_stats} : ''); + if($params{graph_stats}) { + #set all zeroto reset default selection + foreach my $s (keys %graphstats) { + $graphstats{$s} = 0; + $webstats{$s} = 0; + } + my @tmp = split(',',$params{graph_stats}); + foreach my $s (@tmp) { + if(exists $graphstats{$s}) { + $graphstats{$s} = 1; + } else { + &printError('unknown option "'.$s.'" for -graph_stats'); + } + } + } else { + &printError('please specify at least one option for -graph_stats'); + } +} +if(exists $params{graph_stats} && exists $params{web}) { + &printError('option -graph_stats cannot be used in combination with -web'); +} +#web output +my $webnoprocess = 0; +if(exists $params{web}) { + $command .= ' -web'.($params{web} ? ' '.$params{web} : ''); + if($params{web}) { + unless($params{web} eq 'process') { + $webnoprocess = 1; + foreach my $s (keys %webstats) { + $webstats{$s} = 0; + $graphstats{$s} = 0; + } + my @tmp = split(',',$params{web}); + foreach my $s (@tmp) { + $webstats{$s} = 1; + } + } + } +} +if(exists $params{graph_stats} && $graphstats{da}) { + if($exactonly) { + &printError('"-exact_only" and "-graph_stats da" cannot be specified at the same time'); + } else { + $graphstats{de} = 0; + } +} +#do not calculate all duplicates for paired-end data (at least for now) +if($file2) { + if(exists $webstats{da}) { + $webstats{da} = 0; + } + if(exists $graphstats{da}) { + $graphstats{da} = 0; + } +} +if($webstats{da}) { + $webstats{de} = 0; +} +if($graphstats{da}) { + $graphstats{de} = 0; +} +if((exists $params{graph_data} && $graphstats{de}) || (exists $params{web} && $webstats{de})) { + $exactonly = 1; +} + +#custom params +my @cps = (); +if(exists $params{custom_params}) { + $command .= ' -custom_params "'.$params{custom_params}.'"'; + my ($repeats,@tmp,$bases); + foreach my $rule (split(/\s*\;\s*/,$params{custom_params})) { + $repeats = 1; + @tmp = split(/\s+/,$rule); + next unless(scalar(@tmp) == 2); + $bases = ($tmp[0] =~ tr/ACGTN//); + next if($bases < length($tmp[0])); + if(index($tmp[1],'%') != -1) { + $tmp[1] =~ s/\%//g; + $repeats = 0; + } + next unless($tmp[1] =~ m/^\d+$/o); + push(@cps,[$repeats,$tmp[0],$tmp[1]]); + } +} + +#add remaining to log command +if(exists $params{log} || exists $params{graph_data}) { + if(exists $params{log}) { + $command .= ' -log'.($params{log} ? ' '.$params{log} : ''); + } + if(exists $params{min_len}) { + $command .= ' -min_len '.$params{min_len}; + } + if(exists $params{max_len}) { + $command .= ' -max_len '.$params{max_len}; + } + if(exists $params{range_len}) { + $command .= ' -range_len '.$params{range_len}; + } + if(exists $params{min_gc}) { + $command .= ' -min_gc '.$params{min_gc}; + } + if(exists $params{max_gc}) { + $command .= ' -max_gc '.$params{max_gc}; + } + if(exists $params{range_gc}) { + $command .= ' -range_gc '.$params{range_gc}; + } + if(exists $params{min_qual_score}) { + $command .= ' -min_qual_score '.$params{min_qual_score}; + } + if(exists $params{max_qual_score}) { + $command .= ' -max_qual_score '.$params{max_qual_score}; + } + if(exists $params{min_qual_mean}) { + $command .= ' -min_qual_mean '.$params{min_qual_mean}; + } + if(exists $params{max_qual_mean}) { + $command .= ' -max_qual_mean '.$params{max_qual_mean}; + } + if(exists $params{ns_max_p}) { + $command .= ' -ns_max_p '.$params{ns_max_p}; + } + if(exists $params{ns_max_n}) { + $command .= ' -ns_max_n '.$params{ns_max_n}; + } + if(exists $params{noniupac}) { + $command .= ' -noniupac'; + } + if(exists $params{seq_num}) { + $command .= ' -seq_num '.$params{seq_num}; + } + if(exists $params{trim_to_len}) { + $command .= ' -trim_to_len '.$params{trim_to_len}; + } + if(exists $params{trim_left}) { + $command .= ' -trim_left '.$params{trim_left}; + } + if(exists $params{trim_right}) { + $command .= ' -trim_right '.$params{trim_right}; + } + if(exists $params{trim_left_p}) { + $command .= ' -trim_left_p '.$params{trim_left_p}; + } + if(exists $params{trim_right_p}) { + $command .= ' -trim_right_p '.$params{trim_right_p}; + } + if(exists $params{rm_header}) { + $command .= ' -rm_header'; + } + if(exists $params{stats_len}) { + $command .= ' -stats_len'; + } + if(exists $params{stats_dinuc}) { + $command .= ' -stats_dinuc'; + } + if(exists $params{stats_info}) { + $command .= ' -stats_info'; + } + if(exists $params{stats_tag}) { + $command .= ' -stats_tag'; + } + if(exists $params{stats_dupl}) { + $command .= ' -stats_dupl'; + } + if(exists $params{stats_ns}) { + $command .= ' -stats_ns'; + } + if(exists $params{stats_assembly}) { + $command .= ' -stats_assembly'; + } + if(exists $params{verbose}) { + $command .= ' -verbose'; + } + if(exists $params{out_good}) { + $command .= ' -out_good '.$params{out_good}; + } + if(exists $params{out_bad}) { + $command .= ' -out_bad '.$params{out_bad}; + } + if(exists $params{no_qual_header}) { + $command .= ' -no_qual_header'; + } + + if(exists $params{log}) { + unless($params{log}) { + $params{log} = join("__",$file1||'nonamegiven').'.log'; + } + $params{log} = cwd().'/'.$params{log} unless($params{log} =~ /^\//); + if(exists $params{web}) { + &printLog("Executing PRINSEQ using params file"); + } else { + &printLog("Executing PRINSEQ with command: \"perl prinseq-".$WHAT.".pl".$command."\""); + } + } +} + +# +################################################################################ +## DATA PROCESSING +################################################################################ +# + +#order of processing: +#seq_num, trim_left, trim_right, trim_left_p, trim_right_p, trim_qual_left, trim_qual_right, trim_tail_left, trim_tail_right, trim_ns_left, trim_ns_right, trim_to_len, min_len, max_len, range_len, min_qual_score, max_qual_score, min_qual_mean, max_qual_mean, min_gc, max_gc, range_gc, ns_max_p, ns_max_n, noniupac, lc_method, derep, seq_id, seq_case, dna_rna, out_format + +my $filename = $file1; +while($filename =~ /[\w\d]+\.[\w\d]+$/) { + $filename =~ s/\.[\w\d]+$//; + last if($filename =~ /\/[^\.]+$/); +} +my $filename2; +if($file2) { + $filename2 = $file2; + while($filename2 =~ /[\w\d]+\.[\w\d]+$/) { + $filename2 =~ s/\.[\w\d]+$//; + last if($filename2 =~ /\/[^\.]+$/); + } +} + +#create filehandles for the output data +my ($fhgood,$fhgood2,$fhgood3,$fh2good,$fh2good2,$fhbad,$fhbad2,$fhbad3,$fh2bad,$fhmappings); +my ($filenamegood,$filenamegood2,$filenamegood3,$filename2good,$filename2good2,$filenamebad,$filenamebad2,$filenamebad3,$filename2bad,$filenamemappings); +my ($nogood,$nobad,$stdoutgood,$stdoutbad,$mappings); +$nogood = $nobad = $stdoutgood = $stdoutbad = $mappings = 0; +if(exists $params{out_good}) { + if($params{out_good} eq 'null') { + $nogood = 1; + } elsif($params{out_good} eq 'stdout') { + $stdoutgood = 1; + } else { + if($filename2) { + #first input file outputs + open($fhgood,">".$params{out_good}.'_1.fast'.($params{out_format} == 3 ? 'q' : 'a')) or &printError('cannot open output file'); + $filenamegood = $params{out_good}.'_1.fast'.($params{out_format} == 3 ? 'q' : 'a'); + open($fhgood2,">".$params{out_good}.'_1_singletons.fast'.($params{out_format} == 3 ? 'q' : 'a')) or &printError('cannot open output file'); + $filenamegood2 = $params{out_good}.'_1_singletons.fast'.($params{out_format} == 3 ? 'q' : 'a'); + #second input file outputs + open($fh2good,">".$params{out_good}.'_2.fast'.($params{out_format} == 3 ? 'q' : 'a')) or &printError('cannot open output file'); + $filename2good = $params{out_good}.'_2.fast'.($params{out_format} == 3 ? 'q' : 'a'); + open($fh2good2,">".$params{out_good}.'_2_singletons.fast'.($params{out_format} == 3 ? 'q' : 'a')) or &printError('cannot open output file'); + $filename2good2 = $params{out_good}.'_2_singletons.fast'.($params{out_format} == 3 ? 'q' : 'a'); + } else { + open($fhgood,">".$params{out_good}.'.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a')) or &printError('cannot open output file'); + $filenamegood = $params{out_good}.'.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a'); + } + } +} else { + if($filename2) { + $fhgood = File::Temp->new( TEMPLATE => $filename.'_prinseq_good_XXXX', + SUFFIX => '.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a'), + UNLINK => 0); + $filenamegood = $fhgood->filename; + $fhgood2 = File::Temp->new( TEMPLATE => $filename.'_prinseq_good_singletons_XXXX', + SUFFIX => '.fast'.($params{out_format} == 3 ? 'q' : 'a'), + UNLINK => 0); + $filenamegood2 = $fhgood2->filename; + $fh2good = File::Temp->new( TEMPLATE => $filename2.'_prinseq_good_XXXX', + SUFFIX => '.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a'), + UNLINK => 0); + $filename2good = $fh2good->filename; + $fh2good2 = File::Temp->new( TEMPLATE => $filename2.'_prinseq_good_singletons_XXXX', + SUFFIX => '.fast'.($params{out_format} == 3 ? 'q' : 'a'), + UNLINK => 0); + $filename2good2 = $fh2good2->filename; + } else { + $fhgood = File::Temp->new( TEMPLATE => $filename.'_prinseq_good_XXXX', + SUFFIX => '.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a'), + UNLINK => 0); + $filenamegood = $fhgood->filename; + } +} +if(exists $params{out_bad}) { + if($params{out_bad} eq 'null') { + $nobad = 1; + } elsif($params{out_bad} eq 'stdout') { + $stdoutbad = 1; + } else { + if($filename2) { + open($fhbad,">".$params{out_bad}.'_1.fast'.($params{out_format} == 3 ? 'q' : 'a')) or &printError('cannot open output file'); + $filenamebad = $params{out_bad}.'_1.fast'.($params{out_format} == 3 ? 'q' : 'a'); + open($fh2bad,">".$params{out_bad}.'_2.fast'.($params{out_format} == 3 ? 'q' : 'a')) or &printError('cannot open output file'); + $filename2bad = $params{out_bad}.'_2.fast'.($params{out_format} == 3 ? 'q' : 'a'); + } else { + open($fhbad,">".$params{out_bad}.'.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a')) or &printError('cannot open output file'); + $filenamebad = $params{out_bad}.'.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a'); + } + } +} else { + $fhbad = File::Temp->new( TEMPLATE => $filename.'_prinseq_bad_XXXX', + SUFFIX => '.fast'.(($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5 ) ? 'q' : 'a'), + UNLINK => 0); + $filenamebad = $fhbad->filename; + if($filename2) { + $fh2bad = File::Temp->new( TEMPLATE => $filename2.'_prinseq_bad_XXXX', + SUFFIX => '.fast'.($params{out_format} ? 'q' : 'a'), + UNLINK => 0); + $filename2bad = $fh2bad->filename; + } +} +if($params{out_format} == 2 || $params{out_format} == 5) { + if(exists $params{out_good}) { + unless($nogood) { + open($fhgood2,">".$params{out_good}.'.qual') or &printError('cannot open output file'); + $filenamegood2 = $params{out_good}.'.qual'; + } + } else { + $fhgood2 = File::Temp->new( TEMPLATE => $filename.'_prinseq_good_XXXX', + SUFFIX => '.qual', + UNLINK => 0); + $filenamegood2 = $fhgood2->filename; + } + if(exists $params{out_bad}) { + unless($nobad) { + open($fhbad2,">".$params{out_bad}.'.qual') or &printError('cannot open output file'); + $filenamebad2 = $params{out_bad}.'.qual'; + } + } else { + $fhbad2 = File::Temp->new( TEMPLATE => $filename.'_prinseq_bad_XXXX', + SUFFIX => '.qual', + UNLINK => 0); + $filenamebad2 = $fhbad2->filename; + } +} +if($params{out_format} == 4 || $params{out_format} == 5) { + if(exists $params{out_good}) { + unless($nogood) { + open($fhgood3,">".$params{out_good}.'.fasta') or &printError('cannot open output file'); + $filenamegood3 = $params{out_good}.'.fasta'; + } + } else { + $fhgood3 = File::Temp->new( TEMPLATE => $filename.'_prinseq_good_XXXX', + SUFFIX => '.fasta', + UNLINK => 0); + $filenamegood3 = $fhgood3->filename; + } + if(exists $params{out_bad}) { + unless($nobad) { + open($fhbad3,">".$params{out_bad}.'.fasta') or &printError('cannot open output file'); + $filenamebad3 = $params{out_bad}.'.fasta'; + } + } else { + $fhbad3 = File::Temp->new( TEMPLATE => $filename.'_prinseq_bad_XXXX', + SUFFIX => '.fasta', + UNLINK => 0); + $filenamebad3 = $fhbad3->filename; + } +} +if(exists $params{seq_id_mappings}) { + $mappings = 1; + if($params{seq_id_mappings}) { + open($fhmappings,">".$params{seq_id_mappings}) or &printError('cannot open output file'); + $filenamemappings = $params{seq_id_mappings}; + } else { + open($fhmappings,">".$filename.'_prinseq_good.ids') or &printError('cannot open output file'); + $filenamemappings = $filename.'_prinseq_good.ids'; + } +} + +my $numlines = 0; +$webnoprocess = 1 if(!$webnoprocess && exists $params{graph_data} && !exists $params{web} && $nogood && $nobad); +my ($progress,$counter,$part); +$progress = 0; +$counter = $part = 1; +if(exists $params{verbose}) { + print STDERR "Estimate size of input data for status report (this might take a while for large files)\n"; + $numlines = ($file1 eq 'stdin' ? 1 : &getLineNumber($file1)); + $numlines += &getLineNumber($file2) if($file2); + print STDERR "\tdone\n"; +} +if(exists $params{web}) { + &printWeb("STATUS: Estimate size of input data for status report (this might take a while for large files)"); + $numlines = &getLineNumber($file1) unless($numlines); + $numlines += &getLineNumber($file2) if($file2) +} +#for progress bar +if($numlines) { + $part = int($numlines/100); +} + +#parse input data +print STDERR "Parse and process input data\n" if(exists $params{verbose}); +print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); +if(exists $params{web}) { + &printWeb("STATUS: Parsing and processing input data"); + &printWeb("STATUS: process-status $progress"); + &printLog("Parsing and processing input data"); +} else { + &printLog("Parsing and processing input data: \"".$file1."\"".(exists $params{qual} ? " and \"".$params{qual}."\"" : "").($file2 ? " and \"".$file2."\"" : "")); +} +my $numseqs = 0; +my $numseqs2 = 0; +my $pairs = 0; +my $goodcount = 0; +my $badcount = 0; +my $badcount2 = 0; +my ($tsvfile,$seqid,$header,$seq,$qual,$count,$numbases,$numbases2,$length,$seqgd); +$count = 0; +$qual = ''; +$seq = ''; +#stats data +my (%stats,%kmers,%odds,%counts,%graphdata,%kmers2,%counts2,%graphdata2,$md5,$md5r,%md5s,%md5sg,$ucseq,$maxlength); +$maxlength = 0; +#parse data +my $seqcount = 0; +my $seqcount1 = 0; #singleton file 1 +my $seqcount2 = 0; #singleton file 2 +my $seqbases = 0; +my $seqbases1 = 0; #singleton file 1 +my $seqbases2 = 0; #singleton file 2 +my $badbases = 0; +my $badbases2 = 0; +my (@seqs,@seqsP,@printtmp,$line); #seqs for stats and graph data, seqsP for preprocessing sequences - all used for duplicate checking/removal + +if($file2) { + open(FILE,"perl -pe 's/\r\n|\r/\n/g' < $file1 |") or &printError("Could not open file $file1: $!"); + open(FILE2,"perl -pe 's/\r\n|\r/\n/g' < $file2 |") or &printError("Could not open file $file2: $!"); +} else { + if($file1 eq 'stdin') { + *FILE = *STDIN; + } else { + open(FILE,"perl -pe 's/\r\n|\r/\n/g' < $file1 |") or &printError("Could not open file $file1: $!"); + } + if(exists $params{qual}) { + if($params{qual} eq 'stdin') { + &printError('QUAL data cannot be read from STDIN'); + } else { + open(FILE2,"perl -pe 's/\r\n|\r/\n/g' < ".$params{qual}." |") or &printError("Could not open file ".$params{qual}.": $!"); + } + } +} + +my $exists_stats = (exists $params{stats} ? 1 : 0); +my $exists_graphdata = (exists $params{graph_data} ? 1 : 0); + +if($file2) { + my ($seqid2,$seq2,$header2,$qual2,$length2,%tmpids,@tmpdata1,@tmpdata2,$tmpindex,$tmpentry,$skip1,$skip2,$tmpid,$tmpid2); + $skip1 = $skip2 = 0; + if(exists $params{fastq}) { + while(1) { + ($seq,$seqid,$qual,$header,$tmpid) = &readEntryFastq(*FILE,$skip1--,$trimnum1); + ($seq2,$seqid2,$qual2,$header2,$tmpid2) = &readEntryFastq(*FILE2,$skip2--,$trimnum2); + last unless($seq || $seq2); + #check if both have same id; if not, store in tmpdata + if(defined $tmpid && defined $tmpid2 && $tmpid eq $tmpid2) { #same ids + &processEntryPairedEnd(length($seq),$seq,$seqid,$header,$qual,length($seq2),$seq2,$seqid2,$header2,$qual2); + if(keys %tmpids) { #empty tmpdata + %tmpids = (); + while(@tmpdata1) { + &processEntryPairedEnd(@{(shift(@tmpdata1))}[0..4]); + } + while(@tmpdata2) { + &processEntryPairedEnd(undef,undef,undef,undef,undef,@{(shift(@tmpdata2))}[0..4]); + } + } + $skip1 = $skip2 = 0; + } elsif(defined $tmpid && exists $tmpids{$tmpid}) { + while(@tmpdata1) { + $tmpentry = shift(@tmpdata1); + &processEntryPairedEnd(@$tmpentry[0..4]); + delete($tmpids{$tmpentry->[5]}); + } + $tmpindex = $tmpids{$tmpid}; + while($tmpindex-- > 0) { + $tmpentry = shift(@tmpdata2); + &processEntryPairedEnd(undef,undef,undef,undef,undef,@$tmpentry[0..4]); + delete($tmpids{$tmpentry->[5]}); + } + #correct indices + $tmpindex = $tmpids{$tmpid}+1; + foreach my $k (keys %tmpids) { + $tmpids{$k} -= $tmpindex; + } + &processEntryPairedEnd(length($seq),$seq,$seqid,$header,$qual,@{(shift(@tmpdata2))}[0..4]); + delete($tmpids{$tmpid}); + if(defined $seqid2) { + $tmpids{$tmpid2} = scalar(@tmpdata2); + push(@tmpdata2,[length($seq2),$seq2,$seqid2,$header2,$qual2,$tmpid2]); + } + #equal out tmpdata1 and tmpdata2 + unless(scalar(@tmpdata1) == scalar(@tmpdata2)) { + $skip1 = 0; + $skip2 = scalar(@tmpdata2); + } + } elsif(defined $seqid2 && exists $tmpids{$tmpid2}) { + while(@tmpdata2) { + $tmpentry = shift(@tmpdata2); + &processEntryPairedEnd(undef,undef,undef,undef,undef,@$tmpentry[0..4]); + delete($tmpids{$tmpentry->[5]}); + } + $tmpindex = $tmpids{$tmpid2}; + while($tmpindex-- > 0) { + $tmpentry = shift(@tmpdata1); + &processEntryPairedEnd(@$tmpentry[0..4]); + delete($tmpids{$tmpentry->[5]}); + } + #correct indices + $tmpindex = $tmpids{$tmpid2}+1; + foreach my $k (keys %tmpids) { + $tmpids{$k} -= $tmpindex; + } + &processEntryPairedEnd(@{(shift(@tmpdata1))}[0..4],length($seq2),$seq2,$seqid2,$header2,$qual2); + delete($tmpids{$tmpid2}); + if(defined $seqid) { + $tmpids{$tmpid} = scalar(@tmpdata1); + push(@tmpdata1,[length($seq),$seq,$seqid,$header,$qual,$tmpid]); + } + #equal out tmpdata1 and tmpdata2 + unless(scalar(@tmpdata1) == scalar(@tmpdata2)) { + $skip1 = scalar(@tmpdata1); + $skip2 = 0; + } + } else { #store in tmpdata + if(defined $seqid) { + $tmpids{$tmpid} = scalar(@tmpdata1); + push(@tmpdata1,[length($seq),$seq,$seqid,$header,$qual,$tmpid]); + } + if(defined $seqid2) { + $tmpids{$tmpid2} = scalar(@tmpdata2); + push(@tmpdata2,[length($seq2),$seq2,$seqid2,$header2,$qual2,$tmpid2]); + } + } + #progress bar stuff + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: process-status $progress"); + } + } + #process remaining in tmpdata + while(@tmpdata1) { + $tmpentry = shift(@tmpdata1); + &processEntryPairedEnd(@$tmpentry[0..4]); + delete($tmpids{$tmpentry->[5]}); + } + while(@tmpdata2) { + $tmpentry = shift(@tmpdata2); + &processEntryPairedEnd(undef,undef,undef,undef,undef,@$tmpentry[0..4]); + delete($tmpids{$tmpentry->[5]}); + } + } else { #format is FASTA + my ($nextid,$nextheader,$nextid2,$nextheader2); + while(1) { + ($seq,$seqid,$header,$tmpid,$nextid,$nextheader) = &readEntryFasta(*FILE,$skip1--,$trimnum1,$nextid,$nextheader); + ($seq2,$seqid2,$header2,$tmpid2,$nextid2,$nextheader2) = &readEntryFasta(*FILE2,$skip2--,$trimnum2,$nextid2,$nextheader2); + last unless($seq || $seq2); + #check if both have same id; if not, store in tmpdata + if(defined $tmpid && defined $tmpid2 && $tmpid eq $tmpid2) { #same ids + &processEntryPairedEnd(length($seq),$seq,$seqid,$header,undef,length($seq2),$seq2,$seqid2,$header2,undef); + if(keys %tmpids) { #empty tmpdata + %tmpids = (); + while(@tmpdata1) { + &processEntryPairedEnd(@{(shift(@tmpdata1))}[0..3]); + } + while(@tmpdata2) { + &processEntryPairedEnd(undef,undef,undef,undef,undef,@{(shift(@tmpdata2))}[0..3]); + } + } + $skip1 = $skip2 = 0; + } elsif(defined $tmpid && exists $tmpids{$tmpid}) { + while(@tmpdata1) { + $tmpentry = shift(@tmpdata1); + &processEntryPairedEnd(@$tmpentry[0..3]); + delete($tmpids{$tmpentry->[4]}); + } + $tmpindex = $tmpids{$tmpid}; + while($tmpindex-- > 0) { + $tmpentry = shift(@tmpdata2); + &processEntryPairedEnd(undef,undef,undef,undef,undef,@$tmpentry[0..3]); + delete($tmpids{$tmpentry->[4]}); + } + #correct indices + $tmpindex = $tmpids{$tmpid}+1; + foreach my $k (keys %tmpids) { + $tmpids{$k} -= $tmpindex; + } + &processEntryPairedEnd(length($seq),$seq,$seqid,$header,undef,@{(shift(@tmpdata2))}[0..3]); + delete($tmpids{$tmpid}); + if(defined $seqid2) { + $tmpids{$tmpid2} = scalar(@tmpdata2); + push(@tmpdata2,[length($seq2),$seq2,$seqid2,$header2,$tmpid2]); + } + #equal out tmpdata1 and tmpdata2 + unless(scalar(@tmpdata1) == scalar(@tmpdata2)) { + $skip1 = 0; + $skip2 = scalar(@tmpdata2); + } + } elsif(defined $seqid2 && exists $tmpids{$tmpid2}) { + while(@tmpdata2) { + $tmpentry = shift(@tmpdata2); + &processEntryPairedEnd(undef,undef,undef,undef,undef,@$tmpentry[0..3]); + delete($tmpids{$tmpentry->[4]}); + } + $tmpindex = $tmpids{$tmpid2}; + while($tmpindex-- > 0) { + $tmpentry = shift(@tmpdata1); + &processEntryPairedEnd(@$tmpentry[0..3]); + delete($tmpids{$tmpentry->[4]}); + } + #correct indices + $tmpindex = $tmpids{$tmpid2}+1; + foreach my $k (keys %tmpids) { + $tmpids{$k} -= $tmpindex; + } + &processEntryPairedEnd(@{(shift(@tmpdata1))}[0..3],undef,length($seq2),$seq2,$seqid2,$header2); + delete($tmpids{$tmpid2}); + if(defined $seqid) { + $tmpids{$tmpid} = scalar(@tmpdata1); + push(@tmpdata1,[$length,$seq,$seqid,$header,$tmpid]); + } + #equal out tmpdata1 and tmpdata2 + unless(scalar(@tmpdata1) == scalar(@tmpdata2)) { + $skip1 = scalar(@tmpdata1); + $skip2 = 0; + } + } else { #store in tmpdata + if(defined $seqid) { + $tmpids{$tmpid} = scalar(@tmpdata1); + push(@tmpdata1,[$length,$seq,$seqid,$header,$tmpid]); + } + if(defined $seqid2) { + $tmpids{$tmpid2} = scalar(@tmpdata2); + push(@tmpdata2,[length($seq2),$seq2,$seqid2,$header2,$tmpid2]); + } + } + #progress bar stuff + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: process-status $progress"); + } + } + #process remaining in tmpdata + while(@tmpdata1) { + $tmpentry = shift(@tmpdata1); + &processEntryPairedEnd(@$tmpentry[0..3]); + delete($tmpids{$tmpentry->[4]}); + } + while(@tmpdata2) { + $tmpentry = shift(@tmpdata2); + &processEntryPairedEnd(undef,undef,undef,undef,undef,@$tmpentry[0..3]); + delete($tmpids{$tmpentry->[4]}); + } + } +} else { + if(exists $params{fastq}) { + foreach(@dataread) { #only used for stdin inputs + chomp(); + if($count == 0 && /^\@(\S+)\s*(.*)$/o) { + $length = length($seq); + if($length) { + &processEntry($length,$seq,$seqid,$qual,$header); + } + $seqid = $1; + $header = $2 || ''; + $seq = ''; + $qual = ''; + } elsif($count == 1) { + $seq = $_; + } elsif($count == 3) { + $qual = $_; + $count = -1; + } + $count++; + } + while() { + chomp(); + if($count == 0 && /^\@(\S+)\s*(.*)$/o) { + $length = length($seq); + if($length) { + &processEntry($length,$seq,$seqid,$qual,$header); + } + $seqid = $1; + $header = $2 || ''; + $seq = ''; + $qual = ''; + } elsif($count == 1) { + $seq = $_; + } elsif($count == 3) { + $qual = $_; + $count = -1; + } + $count++; + #progress bar stuff + $counter++; + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: process-status $progress"); + } + } + #add last one + $length = length($seq); + if($length) { + &processEntry($length,$seq,$seqid,$qual,$header); + } + } elsif(exists $params{fasta}) { + foreach(@dataread) { #only used for stdin inputs + chomp(); + if(/^>(\S+)\s*(.*)$/o) { + $length = length($seq); + if($length) { + #get qual data if provided + if(exists $params{qual}) { + while() { + chomp(); + last if(/^>/ && $qual); + next if(/^>/); + $qual .= $_.' '; + } + $qual = &convertQualNumsToAsciiString($qual); + } + &processEntry($length,$seq,$seqid,$qual,$header); + $qual = ''; + } + $seqid = $1; + $header = $2 || ''; + $seq = ''; + } else { + $seq .= $_; + } + } + while() { + chomp(); + if(/^>(\S+)\s*(.*)$/o) { + $length = length($seq); + if($length) { + #get qual data if provided + if(exists $params{qual}) { + while() { + chomp(); + last if(/^>/ && $qual); + next if(/^>/); + $qual .= $_.' '; + } + $qual = &convertQualNumsToAsciiString($qual); + } + &processEntry($length,$seq,$seqid,$qual,$header); + $qual = ''; + } + $seqid = $1; + $header = $2 || ''; + $seq = ''; + } else { + $seq .= $_; + } + #progress bar stuff + $counter++; + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: process-status $progress"); + } + } + #add last one + $length = length($seq); + if($length) { + #get qual data if provided + if(exists $params{qual}) { + while() { + chomp(); + last if(/^>/ && $qual); + next if(/^>/); + $qual .= $_.' '; + } + $qual = &convertQualNumsToAsciiString($qual); + } + &processEntry($length,$seq,$seqid,$qual,$header); + $qual = ''; + } + } +} + +print STDERR "\r\tdone \n" if(exists $params{verbose}); +&printWeb("STATUS: process-status 100"); +&printWeb("Done parsing and processing input data"); +&printLog("Done parsing and processing input data"); + +if($derep && !exists $params{stats} && !$exactonly && !$webnoprocess) { + &printLog("Remove duplicates"); + &derepSeqs(); + &printLog("Done removing duplicates"); +} + +close(FILE); +close(FILE2) if(exists $params{qual} || $file2); + +print STDERR "Clean up empty files\n" if(exists $params{verbose} && !exists $params{stats}); +#close filehandles +unless($nogood || $stdoutgood) { + close($fhgood); + if($file2) { + close($fhgood2); + close($fh2good); + close($fh2good2); + } +} +unless($nobad || $stdoutbad) { + close($fhbad); + if($file2) { + close($fh2bad); + } +} +if($params{out_format} == 2 || $params{out_format} == 5) { + close($fhgood2) unless($nogood || $stdoutgood); + close($fhbad2) unless($nobad || $stdoutbad); +} +if($params{out_format} == 4 || $params{out_format} == 5) { + close($fhgood3) unless($nogood || $stdoutgood); + close($fhbad3) unless($nobad || $stdoutbad); +} +if(exists $params{seq_id_mappings}) { + close($fhmappings); +} +#remove empty files +if($seqcount == 0 && !$nogood && !$stdoutgood) { + unlink($filenamegood); + if($params{out_format} == 2 || $params{out_format} == 5) { + unlink($filenamegood2); + } + if($params{out_format} == 4 || $params{out_format} == 5) { + unlink($filenamegood3); + } + unlink($fhmappings); +} +if($badcount == 0 && !$nobad && !$stdoutbad) { + unlink($filenamebad); + if($params{out_format} == 2 || $params{out_format} == 5) { + unlink($filenamebad2); + } + if($params{out_format} == 4 || $params{out_format} == 5) { + unlink($filenamebad3); + } +} +if($file2) { + foreach my $f ($filenamegood2,$filename2good,$filename2good2,$filename2bad) { + if(defined $f && -e $f && &getLineNumber($f) == 0) { + unlink($f); + } + } +} +print STDERR "\tdone\n" if(exists $params{verbose} && !exists $params{stats}); + +#print processing stats +if((exists $params{verbose} || !$webnoprocess) && !exists $params{stats}) { + print STDERR "Input and filter stats:\n"; + if($file2) { + print STDERR "\tInput sequences (file 1): ".&addCommas($numseqs)."\n"; + print STDERR "\tInput bases (file 1): ".&addCommas($numbases)."\n"; + print STDERR "\tInput mean length (file 1): ".sprintf("%.2f",$numbases/$numseqs)."\n" if($numseqs); + print STDERR "\tInput sequences (file 2): ".&addCommas($numseqs2)."\n"; + print STDERR "\tInput bases (file 2): ".&addCommas($numbases2)."\n"; + print STDERR "\tInput mean length (file 2): ".sprintf("%.2f",$numbases2/$numseqs2)."\n" if($numseqs2); + print STDERR "\tGood sequences (pairs): ".&addCommas($seqcount)."\n" if($numseqs); + print STDERR "\tGood bases (pairs): ".&addCommas($seqbases)."\n" if($seqcount); + print STDERR "\tGood mean length (pairs): ".sprintf("%.2f",$seqbases/$seqcount)."\n" if($seqcount); + print STDERR "\tGood sequences (singletons file 1): ".&addCommas($seqcount1)." (".sprintf("%.2f",(100*$seqcount1/$numseqs))."%)\n" if($numseqs); + print STDERR "\tGood bases (singletons file 1): ".&addCommas($seqbases1)."\n" if($seqcount1); + print STDERR "\tGood mean length (singletons file 1): ".sprintf("%.2f",$seqbases1/$seqcount1)."\n" if($seqcount1); + print STDERR "\tGood sequences (singletons file 2): ".&addCommas($seqcount2)." (".sprintf("%.2f",(100*$seqcount2/$numseqs2))."%)\n" if($numseqs2); + print STDERR "\tGood bases (singletons file 2): ".&addCommas($seqbases2)."\n" if($seqcount2); + print STDERR "\tGood mean length (singletons file 2): ".sprintf("%.2f",$seqbases2/$seqcount2)."\n" if($seqcount2); + print STDERR "\tBad sequences (file 1): ".&addCommas($badcount)." (".sprintf("%.2f",(100*$badcount/$numseqs))."%)\n" if($numseqs); + print STDERR "\tBad bases (file 1): ".&addCommas($badbases)."\n" if($badcount); + print STDERR "\tBad mean length (file 1): ".sprintf("%.2f",$badbases/$badcount)."\n" if($badcount); + print STDERR "\tBad sequences (file 2): ".&addCommas($badcount2)." (".sprintf("%.2f",(100*$badcount2/$numseqs2))."%)\n" if($numseqs2); + print STDERR "\tBad bases (file 2): ".&addCommas($badbases2)."\n" if($badcount2); + print STDERR "\tBad mean length (file 2): ".sprintf("%.2f",$badbases2/$badcount2)."\n" if($badcount2); + } else { + print STDERR "\tInput sequences: ".&addCommas($numseqs)."\n"; + print STDERR "\tInput bases: ".&addCommas($numbases)."\n"; + print STDERR "\tInput mean length: ".sprintf("%.2f",$numbases/$numseqs)."\n" if($numseqs); + print STDERR "\tGood sequences: ".&addCommas($seqcount)." (".sprintf("%.2f",(100*$seqcount/$numseqs))."%)\n" if($numseqs); + print STDERR "\tGood bases: ".&addCommas($seqbases)."\n" if($seqcount); + print STDERR "\tGood mean length: ".sprintf("%.2f",$seqbases/$seqcount)."\n" if($seqcount); + print STDERR "\tBad sequences: ".&addCommas($badcount)." (".sprintf("%.2f",(100*$badcount/$numseqs))."%)\n" if($numseqs); + print STDERR "\tBad bases: ".&addCommas($badbases)."\n" if($badcount); + print STDERR "\tBad mean length: ".sprintf("%.2f",$badbases/$badcount)."\n" if($badcount); + } + my $tmp = &getFiltercounts(); + if($tmp) { + print STDERR "\tSequences filtered by specified parameters:\n"; + if(scalar(@$tmp)) { + print STDERR "\t$_\n" foreach(@$tmp); + } else { + print STDERR "\tnone\n"; + } + } + if(exists $params{log}) { + if($file2) { + &printLog("Input sequences (file 1): ".&addCommas($numseqs)); + &printLog("Input bases (file 1): ".&addCommas($numbases)); + &printLog("Input mean length (file 1): ".sprintf("%.2f",$numbases/$numseqs)) if($numseqs); + &printLog("Input sequences (file 2): ".&addCommas($numseqs2)); + &printLog("Input bases (file 2): ".&addCommas($numbases2)); + &printLog("Input mean length (file 2): ".sprintf("%.2f",$numbases2/$numseqs2)) if($numseqs2); + &printLog("Good sequences (pairs): ".&addCommas($seqcount)) if($numseqs); + &printLog("Good bases (pairs): ".&addCommas($seqbases)) if($seqcount); + &printLog("Good mean length (pairs): ".sprintf("%.2f",$seqbases/$seqcount)) if($seqcount); + &printLog("Good sequences (singletons file 1): ".&addCommas($seqcount1)." (".sprintf("%.2f",(100*$seqcount1/$numseqs))."%)") if($numseqs); + &printLog("Good bases (singletons file 1): ".&addCommas($seqbases1)) if($seqcount1); + &printLog("Good mean length (singletons file 1): ".sprintf("%.2f",$seqbases1/$seqcount1)) if($seqcount1); + &printLog("Good sequences (singletons file 2): ".&addCommas($seqcount2)." (".sprintf("%.2f",(100*$seqcount2/$numseqs2))."%)") if($numseqs2); + &printLog("Good bases (singletons file 2): ".&addCommas($seqbases2)) if($seqcount2); + &printLog("Good mean length (singletons file 2): ".sprintf("%.2f",$seqbases2/$seqcount2)) if($seqcount2); + &printLog("Bad sequences (file 1): ".&addCommas($badcount)." (".sprintf("%.2f",(100*$badcount/$numseqs))."%)") if($numseqs); + &printLog("Bad bases (file 1): ".&addCommas($badbases)) if($badcount); + &printLog("Bad mean length (file 1): ".sprintf("%.2f",$badbases/$badcount)) if($badcount); + &printLog("Bad sequences (file 2): ".&addCommas($badcount2)." (".sprintf("%.2f",(100*$badcount2/$numseqs2))."%)") if($numseqs2); + &printLog("Bad bases (file 2): ".&addCommas($badbases2)) if($badcount2); + &printLog("Bad mean length (file 2): ".sprintf("%.2f",$badbases2/$badcount2)) if($badcount2); + } else { + &printLog("Input sequences: ".&addCommas($numseqs)); + &printLog("Input bases: ".&addCommas($numbases)); + &printLog("Input mean length: ".sprintf("%.2f",$numbases/$numseqs)) if($numseqs); + &printLog("Good sequences: ".&addCommas($seqcount)." (".sprintf("%.2f",(100*$seqcount/$numseqs))."%)") if($numseqs); + &printLog("Good bases: ".&addCommas($seqbases)) if($seqcount); + &printLog("Good mean length: ".sprintf("%.2f",$seqbases/$seqcount)) if($seqcount); + &printLog("Bad sequences: ".&addCommas($badcount)." (".sprintf("%.2f",(100*$badcount/$numseqs))."%)") if($numseqs); + &printLog("Bad bases: ".&addCommas($badbases)) if($badcount); + &printLog("Bad mean length: ".sprintf("%.2f",$badbases/$badcount)) if($badcount); + } + if($tmp) { + &printLog("Sequences filtered by specified parameters:"); + if(scalar(@$tmp)) { + &printLog($_) foreach(@$tmp); + } else { + &printLog("none"); + } + + } + } +} + +#print summary stats +if(exists $params{stats}) { + if(exists $params{stats_info}) { + $stats{stats_info}->{reads} = $numseqs; + $stats{stats_info}->{bases} = $numbases; + if($file2) { + $stats{stats_info2}->{reads} = $numseqs2; + $stats{stats_info2}->{bases} = $numbases2; + } + } + if(exists $params{stats_len}) { + $stats{stats_len} = &generateStats($counts{length}); + $stats{stats_len2} = &generateStats($counts2{length}) if($file2); + } + if(exists $params{stats_dinuc}) { + foreach my $i (keys %odds) { + $stats{stats_dinuc}->{lc($i)} = sprintf("%.9f",$odds{$i}/($numseqs+$numseqs2)); + } + } + if(exists $params{stats_tag}) { + #calculate frequency of 5-mers + my $kmersum = &getTagFrequency(\%kmers); + #check for frequency of MID tags + my $midsum = 0; + my $midcount = 0; + my @midseqs; + foreach my $mid (keys %MIDS) { + $midsum += $MIDS{$mid}; + if($MIDS{$mid} > $numseqs/34) { #in more than 34/100 (approx. 3%) as this is estimated average error for MIDs + $midcount++; + push(@midseqs,$mid); + } + } + $stats{stats_tag}->{midnum} = $midcount; + if($midcount) { + $stats{stats_tag}->{midseq} = join(",",@midseqs); + } + if($midsum > $kmersum->{5}) { + $kmersum->{5} = $midsum; + } + foreach my $kmer (keys %$kmersum) { + $stats{stats_tag}->{'prob'.$kmer} = sprintf("%d",(100/$numseqs*$kmersum->{$kmer})); + } + if($file2) { + my $kmersum2 = &getTagFrequency(\%kmers2); + foreach my $kmer (keys %$kmersum2) { + $stats{stats_tag2}->{'prob'.$kmer} = sprintf("%d",(100/$numseqs2*$kmersum2->{$kmer})); + } + } + } + if(exists $params{stats_assembly}) { + #calculate N50, N90, etc + #sort in decreasing order + my @sortvals = sort {$b <=> $a} keys %{$counts{length}}; + #calculate nx values + my $n50 = $numbases*0.5; + my $n75 = $numbases*0.75; + my $n90 = $numbases*0.9; + my $n95 = $numbases*0.95; + my $curlen = 0; + foreach my $len (@sortvals) { + foreach my $i (1..$counts{length}->{$len}) { + $curlen += $len; + if($curlen >= $n50 && !exists $stats{stats_assembly}->{N50}) { + $stats{stats_assembly}->{N50} = $len; + } elsif($curlen >= $n75 && !exists $stats{stats_assembly}->{N75}) { + $stats{stats_assembly}->{N75} = $len; + } elsif($curlen >= $n90 && !exists $stats{stats_assembly}->{N90}) { + $stats{stats_assembly}->{N90} = $len; + } elsif($curlen >= $n95 && !exists $stats{stats_assembly}->{N95}) { + $stats{stats_assembly}->{N95} = $len; + } + } + } + foreach my $i (50,75,90,95) { + unless(exists $stats{stats_assembly}->{'N'.$i}) { + $stats{stats_assembly}->{'N'.$i} = '-'; + } + } + } + if(exists $params{stats_dupl}) { + #empty vars before n-plicate check + %counts = %kmers = %odds = (); + #0 - exact dub, 1 - prefix, 2 - suffix, 3 - revcomp exact, 4 - revcomp prefix/suffix + my %types = (0 => 'exact', 1 => '5', 2 => '3', 3 => 'exactrevcomp', 4 => 'revcomp'); + my ($dupls,undef,undef) = &checkForDupl(\@seqs,\%types,$numseqs); + #set zero counts + foreach my $s (keys %types) { + $stats{stats_dupl}->{$types{$s}} = 0; + $stats{stats_dupl}->{$types{$s}.'maxd'} = 0; + } + foreach my $n (keys %$dupls) { + foreach my $s (keys %{$dupls->{$n}}) { + $stats{stats_dupl}->{$types{$s}} += $dupls->{$n}->{$s} * $n; + $stats{stats_dupl}->{$types{$s}.'maxd'} = $n unless($stats{stats_dupl}->{$types{$s}.'maxd'} > $n); + $stats{stats_dupl}->{total} += $dupls->{$n}->{$s} * $n; + } + } + } + foreach my $type (sort keys %stats) { + foreach my $value (sort keys %{$stats{$type}}) { + print STDOUT join("\t",$type,$value,(defined $stats{$type}->{$value} ? $stats{$type}->{$value} : '-'))."\n"; + } + } +} + +if(exists $params{graph_data}) { + &printLog("Generate graph data"); + print STDERR "Generate graph data\n" if(exists $params{verbose}); + &printWeb("Start generating statistics from data"); + #get qual stats + my $binval = &getBinVal($maxlength); + if($graphdata{quals} || $graphdata{quala}) { + &printWeb("Generating statistics from quality data"); + if($graphdata{quals}) { + $graphdata{quals} = &generateStatsType($graphdata{quals}); + } + if($graphdata{quala}) { + #calculate bin values + my $tmppos; + foreach my $pos (keys %{$graphdata{quala}}) { + $tmppos = int(($pos-1)/$binval); + foreach my $val (keys %{$graphdata{quala}->{$pos}}) { + $graphdata{qualsbin}->{$tmppos}->{$val} += $graphdata{quala}->{$pos}->{$val}; + } + } + $graphdata{qualsbin} = &generateStatsType($graphdata{qualsbin}); + } + } + if($file2 && ($graphdata2{quals} || $graphdata2{quala})) { + $graphdata{qualsmean2} = $graphdata2{qualsmean}; + if($graphdata2{quals}) { + $graphdata{quals2} = &generateStatsType($graphdata2{quals}); + } + if($graphdata2{quala}) { + #calculate bin values + my $tmppos; + foreach my $pos (keys %{$graphdata2{quala}}) { + $tmppos = int(($pos-1)/$binval); + foreach my $val (keys %{$graphdata2{quala}->{$pos}}) { + $graphdata{qualsbin2}->{$tmppos}->{$val} += $graphdata2{quala}->{$pos}->{$val}; + } + } + $graphdata{qualsbin2} = &generateStatsType($graphdata{qualsbin2}); + } + } + #get length stats + if($graphdata{counts}) { + &printWeb("Generating statistics from basic counts"); + $graphdata{stats} = &generateStatsType($graphdata{counts}); + } + if($file2 && $graphdata2{counts}) { + $graphdata{counts2} = $graphdata2{counts}; + $graphdata{stats2} = &generateStatsType($graphdata2{counts}); + } + #check for ns + if(($webstats{ns} || $graphstats{ns}) && scalar(keys %{$graphdata{counts}->{ns}}) == 0) { + $graphdata{counts}->{ns}->{0} = 0; + } + if($file2 && ($webstats{ns} || $graphstats{ns}) && scalar(keys %{$graphdata{counts2}->{ns}}) == 0) { + $graphdata{counts2}->{ns}->{0} = 0; + } + #add base frequencies + foreach my $site (keys %{$graphdata{freqs}}) { + foreach my $i (0..$TAG_LENGTH-1) { + foreach my $base ('A','C','G','T','N') { + if(exists $graphdata{freqs}->{$site}->{$i}->{$base}) { + $graphdata{freqs}->{$site}->{$i}->{$base} = int($graphdata{freqs}->{$site}->{$i}->{$base}*100/$numseqs); + } else { + $graphdata{freqs}->{$site}->{$i}->{$base} = 0; + } + } + } + } + if($file2) { + foreach my $site (keys %{$graphdata2{freqs}}) { + foreach my $i (0..$TAG_LENGTH-1) { + foreach my $base ('A','C','G','T','N') { + if(exists $graphdata2{freqs}->{$site}->{$i}->{$base}) { + $graphdata{freqs2}->{$site}->{$i}->{$base} = int($graphdata2{freqs}->{$site}->{$i}->{$base}*100/$numseqs2); + } else { + $graphdata{freqs2}->{$site}->{$i}->{$base} = 0; + } + } + } + } + } + #calculate possibility for tag sequences + if(scalar(keys %{$graphdata{kmers}})) { + &printWeb("Generating statistics for tag sequences"); + my %prob; + #calculate frequency of 5-mers + my $kmersum = &getTagFrequency($graphdata{kmers},$numseqs); + #check for frequency of MID tags + my $midsum = 0; + my $midcount = 0; + my @midseqs; + foreach my $mid (keys %{$graphdata{mids}}) { + $midsum += $graphdata{mids}->{$mid}; + if($graphdata{mids}->{$mid} > $numseqs/34) { #in more than 34/100 (approx. 3%) as this is estimated average error for MIDs + $midcount++; + push(@midseqs,$mid); + } + } + if($midcount) { + $graphdata{tagmidseq} = join(",",@midseqs); + } + $graphdata{tagmidnum} = $midcount; + if($midsum > $kmersum->{5}) { + $kmersum->{5} = $midsum; + } + foreach my $kmer (keys %$kmersum) { + $prob{$kmer} = sprintf("%d",(100/$numseqs*$kmersum->{$kmer})); + } + $graphdata{tagprob} = \%prob; + delete($graphdata{kmers}); + } + if($file2 && scalar(keys %{$graphdata2{kmers}})) { + my %prob; + #calculate frequency of 5-mers + my $kmersum = &getTagFrequency($graphdata2{kmers},$numseqs2); + foreach my $kmer (keys %$kmersum) { + $prob{$kmer} = sprintf("%d",(100/$numseqs2*$kmersum->{$kmer})); + } + $graphdata{tagprob2} = \%prob; + delete($graphdata2{kmers}); + } + #add dinucleotide odd ratios + if(($webstats{dn} || $graphstats{dn}) && scalar(keys %{$graphdata{dinucodds}})) { + &printWeb("Generating statistics from dinucleotide counts"); + foreach my $i (keys %{$graphdata{dinucodds}}) { + $graphdata{dinucodds}->{$i} = sprintf("%.9f",$graphdata{dinucodds}->{$i}/($numseqs+$numseqs2)); + } + } + #check for n-plicates (for paired-end data, not separated by input file) + if($webstats{de} || $webstats{da} || $graphstats{de} || $graphstats{da}) { + &printWeb("Generating statistics from duplicate counts"); + if($webstats{de} || $graphstats{de}) { + foreach my $m (keys %md5sg) { + if(exists $md5sg{$m}->{0} && $md5sg{$m}->{0} > 0) { + $graphdata{dubscounts}->{$md5sg{$m}->{0}}->{0}++; + } + if(exists $md5sg{$m}->{3} && $md5sg{$m}->{3} > 0) { + $graphdata{dubscounts}->{$md5sg{$m}->{3}}->{3}++; + } + } + } else { + my %types = (0 => 0, 1 => 0, 2 => 0, 3 => 0, 4 => 0); + ($graphdata{dubscounts},$graphdata{dubslength},undef) = &checkForDupl(\@seqs,\%types,$numseqs); #0 - exact dub, 1 - prefix, 2 - suffix, 3 - revcomp exact, 4 - revcomp prefix/suffix + } + } + + #generate JSON string without the need of the JSON module + my $str = ''; + $str .= '{"numseqs":'.$numseqs.',"numbases":'.$numbases.',"pairedend":'.($file2 ? 1 : 0).($file2 ? ',"numseqs2":'.$numseqs2.',"numbases2":'.$numbases2.',"pairs":'.$pairs : '').',"maxlength":'.$maxlength.',"binval":'.$binval.',"exactonly":'.$exactonly.',"tagmidnum":'.($graphdata{tagmidnum}||0).',"scale":'.$scale.',"filename1":"'.(exists $params{filename1} ? $params{filename1} : &convertStringToInt(&getFileName($file1))).'","format1":"'.(exists $params{fasta} ? 'fasta' : 'fastq').'"'.(exists $params{qual} ? ',"filename2":"'.(exists $params{filename2} ? $params{filename2} : &convertStringToInt(&getFileName($params{qual}))).'","format2":"qual"' : '').($file2 ? ',"filename2":"'.(exists $params{filename2} ? $params{filename2} : &convertStringToInt(&getFileName($file2))).'"' : ''); + foreach my $s (qw(counts counts2 stats stats2 quals quals2 qualsbin qualsbin2 complvals dubscounts dubslength)) { + next unless exists($graphdata{$s}); + $str .= ',"'.$s.'":{'; + foreach my $t (sort keys %{$graphdata{$s}}) { + $str .= '"'.$t.'":{'; + while (my ($k,$v) = each(%{$graphdata{$s}->{$t}})) { + if($v =~ /^\d+$/) { + $str .= '"'.$k.'":'.$v.','; + } else { + $str .= '"'.$k.'":"'.$v.'",'; + } + } + $str =~ s/\,$//; + $str .= '},'; + } + $str =~ s/\,$//; + $str .= '}'; + } + foreach my $s (qw(qualsmean qualsmean2 tagprob tagprob2 compldust complentropy dinucodds)) { + next unless exists($graphdata{$s}); + $str .= ',"'.$s.'":{'; + while (my ($k,$v) = each(%{$graphdata{$s}})) { + if($v =~ /^\d+$/) { + $str .= '"'.$k.'":'.$v.','; + } else { + $str .= '"'.$k.'":"'.$v.'",'; + } + } + $str =~ s/\,$//; + $str .= '}'; + } + $str .= ',"tail":'.(exists $graphdata{counts}->{tail5} || exists $graphdata{counts}->{tail3} ? 1 : 0); + $str .= ',"tail2":'.(exists $graphdata2{counts}->{tail5} || exists $graphdata2{counts}->{tail3} ? 1 : 0) if($file2); + if($webstats{ts} || $graphstats{ts}) { + $str .= ',"freqs":{'; + foreach my $i (keys %{$graphdata{freqs}}) { # 5, 3 + $str .= '"'.$i.'":{'; + foreach my $pos (keys %{$graphdata{freqs}->{$i}}) { + $str .= '"'.$pos.'":{'; + while (my ($base,$v) = each(%{$graphdata{freqs}->{$i}->{$pos}})) { + $str .= '"'.$base.'":'.$v.','; + } + $str =~ s/\,$//; + $str .= '},'; + } + $str =~ s/\,$//; + $str .= '},'; + } + $str =~ s/\,$//; + $str .= '}'; + if($file2) { + $str .= ',"freqs2":{'; + foreach my $i (keys %{$graphdata{freqs2}}) { # 5, 3 + $str .= '"'.$i.'":{'; + foreach my $pos (keys %{$graphdata{freqs2}->{$i}}) { + $str .= '"'.$pos.'":{'; + while (my ($base,$v) = each(%{$graphdata{freqs2}->{$i}->{$pos}})) { + $str .= '"'.$base.'":'.$v.','; + } + $str =~ s/\,$//; + $str .= '},'; + } + $str =~ s/\,$//; + $str .= '},'; + } + $str =~ s/\,$//; + $str .= '}'; + } + } + foreach my $s (qw(tagmidseq)) { + next unless exists($graphdata{$s}); + $str .= ',"'.$s.'":"'.$graphdata{$s}.'"'; + } + + $str =~ s/\,$//; + $str .= '}'; + + #write data to file + my $time = sprintf("%02d/%02d/%04d %02d:%02d:%02d",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); + open(FH, ">", $params{graph_data}) or &printError("Cannot open file ".$params{graph_data}.": $!"); + flock(FH, LOCK_EX) or &printError("Cannot lock file ".$params{graph_data}.": $!"); + print FH "#Graph data\n"; + print FH "#[prinseq-".$WHAT."-$VERSION] [$time] Command: \"perl prinseq-".$WHAT.".pl".$command."\"\n" unless(exists $params{web}); + print FH $str; + flock(FH, LOCK_UN) or &printError("Cannot unlock ".$params{graph_data}.": $!"); + close(FH); + &printLog("Done with graph data"); + print STDERR "\tdone\n" if(exists $params{verbose}); +} + +&printWeb("STATUS: done"); + +## +################################################################################# +### MISC FUNCTIONS +################################################################################# +## + +sub printError { + my $msg = shift; + print STDERR "\nERROR: ".$msg.".\n\nTry \'perl prinseq-".$WHAT.".pl -h\' for more information.\nExit program.\n"; + &printLog("ERROR: ".$msg.". Exit program.\n"); + exit(0); +} + +sub printWarning { + my $msg = shift; + print STDERR "WARNING: ".$msg.".\n"; + &printLog("WARNING: ".$msg.".\n"); +} + +sub printWeb { + my $msg = shift; + if(exists $params{web}) { + print STDERR &getTime()."$msg\n"; + } +} + +sub getTime { + return sprintf("[%02d/%02d/%04d %02d:%02d:%02d] ",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); +} + +sub printLog { + my $msg = shift; + if(exists $params{log}) { + my $time = sprintf("%02d/%02d/%04d %02d:%02d:%02d",sub {($_[4]+1,$_[3],$_[5]+1900,$_[2],$_[1],$_[0])}->(localtime)); + open(FH, ">>", $params{log}) or &printError("Cannot open file ".$params{log}.": $!"); + flock(FH, LOCK_EX) or &printError("Cannot lock file ".$params{log}.": $!"); + print FH "[prinseq-".$WHAT."-$VERSION] [$time] $msg\n"; + flock(FH, LOCK_UN) or &printError("Cannot unlock ".$params{log}.": $!"); + close(FH); + } +} + +sub addCommas { + my $num = shift; + return unless(defined $num); + return $num if($num < 1000); + $num = scalar reverse $num; + $num =~ s/(\d{3})/$1\,/g; + $num =~ s/\,$//; + $num = scalar reverse $num; + return $num; +} + +sub getLineNumber { + my $file = shift; + my $lines = 0; + open(FILE,"perl -pe 's/\r\n|\r/\n/g' < $file |") or &printError("Could not open file $file: $!"); + $lines += tr/\n/\n/ while sysread(FILE, $_, 2 ** 16); + close(FILE); + return $lines; +} + +sub readParamsFile { + my $file = shift; + my @args; + my %parameters = (); + open(FILE,"perl -pe 's/\r\n|\r/\n/g' < $file |") or &printError("Could not open file $file: $!"); + while() { + next if(/^\#/); + chomp(); + @args = split(/\s+/); + if(@args) { + $args[0] =~ s/^\-//; + $parameters{$args[0]} = (defined $args[1] ? join(" ",@args[1..scalar(@args)-1]) : ''); + } + } + close(FILE); + return \%parameters; +} + +sub checkFileFormat { + my $file = shift; + + my ($format,$count,$id,$fasta,$fastq,$qual); + $count = 3; + $fasta = $fastq = $qual = 0; + $format = 'unknown'; + + open(FILE,"perl -pe 's/\r\n|\r/\n/g' < $file |") or &printError("Could not open file $file: $!"); + while () { +# chomp(); + # next unless(length($_)); + if($count-- == 0) { + last; + } elsif(!$fasta && /^\>\S+\s*/o) { + $fasta = 1; + $qual = 1; + } elsif($fasta == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/o) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/o))) { + $fasta = 2; + } elsif($qual == 1 && /^\s*\d+/o) { + $qual = 2; + } elsif(!$fastq && /^\@(\S+)\s*/o) { + $id = $1; + $fastq = 1; + } elsif($fastq == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/o) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/o))) { + $fastq = 2; + } elsif($fastq == 2 && /^\+(\S*)\s*/o) { + $fastq = 3 if($id eq $1 || /^\+\s*$/o); + } + } + close(FILE); + if($fasta == 2) { + $format = 'fasta'; + } elsif($qual == 2) { + $format = 'qual'; + } elsif($fastq == 3) { + $format = 'fastq'; + } + + return $format; +} + +sub checkSlashnum { + my $file = shift; + open(FILE,"perl -pe 's/\r\n|\r/\n/g' < $file |") or &printError("Could not open file $file: $!"); + while() { + chomp(); + next unless(length($_)); + if(/^\S+\/[12]\s*/ || /^\S+\_[LR]\s*/) { + close(FILE); + return (1,2,2); + } elsif(/^\S+\_left\s*/) { + close(FILE); + return (1,6,5); + } elsif(/^\S+\_right\s*/) { + close(FILE); + return (1,5,6); + } else { + close(FILE); + return 0; + } + } + return 0; +} + +sub checkInputFormat { + my ($format,$count,$id,$fasta,$fastq,$qual); + $count = 3; + $fasta = $fastq = $qual = 0; + $format = 'unknown'; + + while () { + push(@dataread,$_); +# chomp(); + # next unless(length($_)); + if($count-- == 0) { + last; + } elsif(!$fasta && /^\>\S+\s*/o) { + $fasta = 1; + $qual = 1; + } elsif($fasta == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/o) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/o))) { + $fasta = 2; + } elsif($qual == 1 && /^\s*\d+/) { + $qual = 2; + } elsif(!$fastq && /^\@(\S+)\s*/) { + $id = $1; + $fastq = 1; + } elsif($fastq == 1 && (($aa && /^[ABCDEFGHIKLMNOPQRSTUVWYZXabcdefghiklmmopqrstuvwyzx*-]+/o) || (!$aa && /^[ACGTURYKMSWBDHVNXacgturykmswbdhvnx-]+/o))) { + $fastq = 2; + } elsif($fastq == 2 && /^\+(\S*)\s*/o) { + $fastq = 3 if($id eq $1 || /^\+\s*$/o); + } + } + + if($fasta == 2) { + $format = 'fasta'; + } elsif($qual == 2) { + $format = 'qual'; + } elsif($fastq == 3) { + $format = 'fastq'; + } + + return $format; +} + +sub getArrayMean { + return @_ ? sum(@_) / @_ : 0; +} + +sub convertQualNumsToAscii { + my $qual = shift; + my @ascii; + $qual =~ s/^\s+//; + $qual =~ s/\s+$//; + my @nums = split(/\s+/,$qual); + foreach(@nums) { + push(@ascii,chr(($_ <= 93 ? $_ : 93) + 33)); + } + return \@ascii; +} + +sub convertQualNumsToAsciiString { + my $qual = shift; + my $ascii; + $qual =~ s/^\s+//; + $qual =~ s/\s+$//; + my @nums = split(/\s+/,$qual); + foreach(@nums) { + $ascii .= chr(($_ <= 93 ? $_ : 93) + 33); + } + return $ascii; +} + +sub convertQualAsciiToNums { + my $qual = shift; + my @nums; + my @ascii = split(//,$qual); + foreach(@ascii) { + push(@nums,(ord($_) - 33)); + } + return \@nums; +} + +sub convertQualAsciiToNumsPhred64 { + my $qual = shift; + my @nums; + my $tmp; + my $err = 0; + my @ascii = split('',$qual); + foreach(@ascii) { + $tmp = (ord($_) - 64); + if($tmp < 0) { + $err = 1; + last; + } + push(@nums,$tmp); + } + return (\@nums,$err); +} + +sub convertQualArrayToString { + my ($nums,$linelen) = @_; + $linelen = 80 unless($linelen); + my $str; + my $count = 1; + foreach my $n (@$nums) { + $str .= ($n < 10 ? ' '.$n : $n).' '; + if(++$count > $linelen) { + $count = 1; + $str =~ s/\s$//; + $str .= "\n"; + } + } + $str =~ s/[\s\n]$//; + return $str; +} + +sub checkRange { + my ($range,$val) = @_; + my @ranges = split(/\,/,$range); + foreach my $r (@ranges) { + my @tmp = split(/\-/,$r); + return 0 if($val < $tmp[0] || $val > $tmp[1]); + } + return 1; +} + +sub getFiltercounts { + my @order = qw(seq_num trim_left trim_right trim_left_p trim_right_p trim_qual_left trim_qual_right trim_tail_left trim_tail_right trim_ns_left trim_ns_right zero_length min_len max_len range_len min_qual_score max_qual_score min_qual_mean max_qual_mean min_gc max_gc range_gc ns_max_p ns_max_n noniupac custom_params lc_method derep); + my @counts; + foreach my $p (@order) { + if(exists $filtercount{$p}) { + push(@counts,$p.': '.$filtercount{$p}); + } + } + return \@counts; +} + +sub readEntryFastq { + my ($fh,$skip,$trimnum) = @_; + if($skip > 0) { + return 1; + } + my ($seq,$seqid,$qual,$header,$line,$tmpid); + if(defined($line = <$fh>)) { + $line =~ /^\@(\S+)\s*(.*)$/o; + $seqid = $1; + $header = $2 || ''; + $seq = readline($fh); + chomp($seq); + readline($fh); + $qual = readline($fh); + chomp($qual); + #progress bar stuff + $counter += 4; + } + if($slashnum) { + $tmpid = substr($seqid, 0, -$trimnum) + } else { + $tmpid = $seqid; + } + return ($seq,$seqid,$qual,$header,$tmpid); +} + +sub readEntryFasta { + my ($fh,$skip,$trimnum,$nextid,$nextheader) = @_; + if($skip > 0) { + return (1,undef,undef,undef,$nextid,$nextheader); + } + my ($seq,$seqid,$header,$line,$tmpid); + $seq = ''; + if($nextid) { + $seqid = $nextid; + $header = $nextheader; + } + while(<$fh>) { + chomp(); + if(/^>(\S+)\s*(.*)$/o) { + if($seq) { + $nextid = $1; + $nextheader = $2 || ''; + last; + } + $seqid = $1; + $header = $2 || ''; + } else { + $seq .= $_; + } + } + if($slashnum) { + $tmpid = substr($seqid, 0, -$trimnum) + } else { + $tmpid = $seqid; + } + return ($seq,$seqid,$header,$tmpid,$nextid,$nextheader); +} + +sub processEntry { + my ($length,$seq,$seqid,$qual,$header) = @_; + #check that sequence and quality are same length + if(defined $qual && length($qual) && $length != length($qual)) { + &printError("The number of bases and quality scores are not the same for sequence \"$seqid\""); + } + #remove anything non-alphabetic from sequences + $seq =~ tr/a-zA-Z//cd; + $numseqs++; + $numbases += $length; + #process entry + if($exists_stats) { #calc summary stats + $seq = uc($seq); + &calcSeqStats($seq,$length,\%stats,\%kmers,\%odds,\%counts); + if(exists $params{stats_dupl}) { + push(@seqs,[$seq,$numseqs,$length]); + } + } else { #process data + &processData($seqid,$seq,$qual,$header) unless($webnoprocess); + #get graph data + if($exists_graphdata) { + $ucseq = uc($seq); + &getSeqStats(\%graphdata,$ucseq,$length); + if($qual) { + &getQualStats(\%graphdata,$qual,$length); + } + if($webstats{de} || $graphstats{de}) { + $md5 = md5_hex($ucseq); + if(exists $md5sg{$md5}) { #forward duplicate + $graphdata{dubslength}->{$length}->{0}++; + $md5sg{$md5}->{0}++; + } else { + $md5r = md5_hex(&revcompuc($ucseq)); + if(exists $md5sg{$md5r}) { #reverse duplicate + $graphdata{dubslength}->{$length}->{3}++; + $md5sg{$md5}->{3}++; + } + unless(exists $md5sg{$md5}) { + $md5sg{$md5} = {0 => 0, 3 => 0}; + } + } + } elsif($webstats{da} || $graphstats{da}) { + push(@seqs,[$ucseq,$numseqs,$length]); + } + } + } +} + +sub processEntryPairedEnd { + my ($length,$seq,$seqid,$header,$qual,$length2,$seq2,$seqid2,$header2,$qual2) = @_; + my ($tmpseq1,$tmpseq2,$tmpgood1,$tmpgood2,$tmpbegin1,$tmpend1,$tmpbegin2,$tmpend2); + $tmpgood1 = $tmpgood2 = 0; + if($length) { + #check that sequence and quality are same length + if(defined $qual && $length != length($qual)) { + &printError("The number of bases and quality scores are not the same for sequence \"$seqid\""); + } + #remove anything non-alphabetic from sequences + $seq =~ tr/a-zA-Z//cd; + $numseqs++; + $numbases += $length; + #process entry + if($exists_stats) { #calc summary stats + $seq = uc($seq); + &calcSeqStats($seq,$length,\%stats,\%kmers,\%odds,\%counts); + } else { #process data + ($tmpseq1,$tmpgood1,$tmpbegin1,$tmpend1) = &processData($seqid,$seq,$qual,$header) unless($webnoprocess); + #get graph data + if($exists_graphdata) { + $ucseq = uc($seq); + &getSeqStats(\%graphdata,$ucseq,$length); + if($qual) { + &getQualStats(\%graphdata,$qual,$length); + } + if($length2) { + $pairs++; + } + if($webstats{de} || $graphstats{de}) { + if($length2) { #both + $md5 = md5_hex($ucseq.'0'.uc($seq2)); + if(exists $md5sg{$md5}) { #forward duplicate + $md5sg{$md5}->{0}++; + $graphdata{dubslength}->{($length+$length2)}->{0}++; + } else { + $md5sg{$md5}->{0} = 0; + } + $md5 = md5_hex($ucseq); + unless(exists $md5sg{$md5}) { + $md5sg{$md5}->{0} = 0; + } + } else { #only seq, no seq2 + $md5 = md5_hex($ucseq); + if(exists $md5sg{$md5}) { #forward duplicate + $md5sg{$md5}->{0}++; + $graphdata{dubslength}->{$length}->{0}++; + } else { + $md5sg{$md5}->{0} = 0; + } + } + } + } + } + } + if($length2) { + if(defined $qual2 && $length2 != length($qual2)) { + &printError("The number of bases and quality scores are not the same for sequence \"$seqid2\""); + } + #remove anything non-alphabetic from sequences + $seq2 =~ tr/a-zA-Z//cd; + $numseqs2++; + $numbases2 += $length2; + #process entry + if($exists_stats) { #calc summary stats + $seq2 = uc($seq2); + &calcSeqStats($seq2,$length2,\%stats,\%kmers2,\%odds,\%counts2,1); #last one used to calculate stats for second input file + } else { #process data + ($tmpseq2,$tmpgood2,$tmpbegin2,$tmpend2) = &processData($seqid2,$seq2,$qual2,$header2) unless($webnoprocess); + #get graph data + if($exists_graphdata) { + $ucseq = uc($seq2); + &getSeqStats(\%graphdata2,$ucseq,$length2); + if($qual2) { + &getQualStats(\%graphdata2,$qual2,$length2); + } + if($webstats{de} || $graphstats{de}) { + #only seq2, no seq + $md5 = md5_hex($ucseq); + unless($length) { + if(exists $md5sg{$md5}) { #forward duplicate + $md5sg{$md5}->{0}++; + $graphdata{dubslength}->{$length2}->{0}++; + } else { + $md5sg{$md5}->{0} = 0; + } + } + } + } + } + } + + return if($exists_stats || $webnoprocess); + + #check for duplicates + if($derep) { + if($tmpgood1 && $tmpgood2) { + $md5 = md5_hex($tmpseq1.'0'.$tmpseq2); + if(exists $md5s{$md5}) { #forward duplicate + $md5s{$md5}++; + if($derepmin <= $md5s{$md5}+1) { + $tmpgood1 = $tmpgood2 = 0; + $filtercount{derep}++; + } + } else { + $md5s{$md5} = 0; + } + $md5 = md5_hex($tmpseq1); + unless(exists $md5s{$md5}) { + $md5s{$md5} = 0; + } + $md5 = md5_hex($tmpseq2); + unless(exists $md5s{$md5}) { + $md5s{$md5} = 0; + } + } elsif($tmpgood1) { + $md5 = md5_hex($tmpseq1); + if(exists $md5s{$md5}) { #forward duplicate + $md5s{$md5}++; + if($derepmin <= $md5s{$md5}+1) { + $tmpgood1 = 0; + $filtercount{derep}++; + } + } else { + $md5s{$md5} = 0; + } + } elsif($tmpgood2) { + $md5 = md5_hex($tmpseq2); + if(exists $md5s{$md5}) { #forward duplicate + $md5s{$md5}++; + if($derepmin <= $md5s{$md5}+1) { + $tmpgood2 = 0; + $filtercount{derep}++; + } + } else { + $md5s{$md5} = 0; + } + } + } + + #write to outputs files + if($tmpgood1) { + if(exists $params{seq_id}) { + if($mappings) { + print $fhmappings join("\t",$seqid,$params{seq_id}.$seqcount)."\n"; + } + $seqid = $params{seq_id}.$seqcount; + } + if(exists $params{rm_header}) { + $header = undef; + } + #trim if necessary + if($tmpbegin1) { + $seq = substr($seq,$tmpbegin1); + $qual = substr($qual,$tmpbegin1) if(defined $qual && length($qual)); + } + if($tmpend1) { + $length = length($seq); + $seq = substr($seq,0,$length-$tmpend1); + $qual = substr($qual,0,$length-$tmpend1) if(defined $qual && length($qual)); + } + #change case + if(exists $params{seq_case}) { + if($params{seq_case} eq 'lower') { #lower case + $seq = lc($seq); + } elsif($params{seq_case} eq 'upper') { #upper case + $seq = uc($seq); + } + } + #convert between DNA and RNA + if(exists $params{dna_rna}) { + if($params{dna_rna} eq 'dna') { #RNA to DNA + $seq =~ tr/Uu/Tt/; + } elsif($params{dna_rna} eq 'rna') { #DNA to RNA + $seq =~ tr/Tt/Uu/; + } + } + } + if($tmpgood2) { + if(exists $params{seq_id}) { + if($mappings) { + print $fhmappings join("\t",$seqid2,$params{seq_id}.$seqcount)."\n"; + } + $seqid2 = $params{seq_id}.$seqcount; + } + if(exists $params{rm_header}) { + $header2 = undef; + } + #trim if necessary + if($tmpbegin2) { + $seq2 = substr($seq2,$tmpbegin2); + $qual2 = substr($qual2,$tmpbegin2) if(defined $qual2 && length($qual2)); + } + if($tmpend2) { + $length2 = length($seq2); + $seq2 = substr($seq2,0,$length2-$tmpend2); + $qual2 = substr($qual2,0,$length2-$tmpend2) if(defined $qual2 && length($qual2)); + } + #change case + if(exists $params{seq_case}) { + if($params{seq_case} eq 'lower') { #lower case + $seq2 = lc($seq2); + } elsif($params{seq_case} eq 'upper') { #upper case + $seq2 = uc($seq2); + } + } + #convert between DNA and RNA + if(exists $params{dna_rna}) { + if($params{dna_rna} eq 'dna') { #RNA to DNA + $seq2 =~ tr/Uu/Tt/; + } elsif($params{dna_rna} eq 'rna') { #DNA to RNA + $seq2 =~ tr/Tt/Uu/; + } + } + } + if($tmpgood1 && $tmpgood2) { #pair + $seqcount++; + $seqbases += $length+$length2; + return if($nogood); + if($params{out_format} == 3) { # FASTQ + &printError("missing quality data for sequence \"$seqid\" or greater number of sequences than available quality scores") unless(defined $qual); + &printError("missing quality data for sequence \"$seqid2\" or greater number of sequences than available quality scores") unless(defined $qual2); + if($stdoutgood) { + print STDOUT '@'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + print STDOUT '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print STDOUT $qual2."\n"; + } else { + print $fhgood '@'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhgood $seq."\n"; + print $fhgood '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print $fhgood $qual."\n"; + print $fh2good '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2good $seq2."\n"; + print $fh2good '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print $fh2good $qual2."\n"; + } + } else { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + $seq2 =~ s/(.{$linelen})/$1\n/g; + $seq2 =~ s/\n$//; + } + if($stdoutgood) { + print STDOUT '>'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + } else { + print $fhgood '>'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhgood $seq."\n"; + print $fh2good '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2good $seq2."\n"; + } + } + } elsif($tmpgood1) { #singleton + $seqcount1++; + $seqbases1 += $length; + return if($nogood); + if($params{out_format} == 3) { # FASTQ + &printError("missing quality data for sequence \"$seqid\" or greater number of sequences than available quality scores") unless(defined $qual); + if($stdoutgood) { + print STDOUT '@'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + } else { + print $fhgood2 '@'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhgood2 $seq."\n"; + print $fhgood2 '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print $fhgood2 $qual."\n"; + } + } else { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + } + if($stdoutgood) { + print STDOUT '>'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + } else { + print $fhgood2 '>'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhgood2 $seq."\n"; + } + } + if($length2) { + $badcount2++; + $badbases2 += length($seq2); + return if($nobad); + #write data + if($params{out_format} == 3) { # FASTQ + &printError("missing quality data for sequence \"$seqid2\" or greater number of sequences than available quality scores") unless(defined $qual2); + if($stdoutbad) { + print STDOUT '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print STDOUT $qual2."\n"; + } else { + print $fh2bad '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2bad $seq2."\n"; + print $fh2bad '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print $fh2bad $qual2."\n"; + } + } else { #FASTA + #set line length + if($linelen) { + $seq2 =~ s/(.{$linelen})/$1\n/g; + $seq2 =~ s/\n$//; + } + if($stdoutbad) { + print STDOUT '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + } else { + print $fh2bad '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2bad $seq2."\n"; + } + } + } + } elsif($tmpgood2) { #singleton + $seqcount2++; + $seqbases2 += $length2; + return if($nogood); + if($params{out_format} == 3) { # FASTQ + &printError("missing quality data for sequence \"$seqid2\" or greater number of sequences than available quality scores") unless(defined $qual2); + if($stdoutgood) { + print STDOUT '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print STDOUT $qual2."\n"; + } else { + print $fh2good2 '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2good2 $seq2."\n"; + print $fh2good2 '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print $fh2good2 $qual2."\n"; + } + } else { #FASTA + #set line length + if($linelen) { + $seq2 =~ s/(.{$linelen})/$1\n/g; + $seq2 =~ s/\n$//; + } + if($stdoutgood) { + print STDOUT '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + } else { + print $fh2good2 '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2good2 $seq2."\n"; + } + } + if($length) { + $badcount++; + $badbases += length($seq); + return if($nobad); + #write data + if($params{out_format} == 3) { # FASTQ + &printError("missing quality data for sequence \"$seqid\" or greater number of sequences than available quality scores") unless(defined $qual); + if($stdoutbad) { + print STDOUT '@'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + } else { + print $fhbad '@'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + print $fhbad '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print $fhbad $qual."\n"; + } + } else { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + } + if($stdoutbad) { + print STDOUT '>'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + } else { + print $fhbad '>'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + } + } + } + } else { + if($length) { + $badcount++; + $badbases += length($seq); + return if($nobad); + #write data + if($params{out_format} == 3) { # FASTQ + &printError("missing quality data for sequence \"$seqid\" or greater number of sequences than available quality scores") unless(defined $qual); + if($stdoutbad) { + print STDOUT '@'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + } else { + print $fhbad '@'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + print $fhbad '+'.(exists $params{no_qual_header} ? '' : $seqid.($header ? ' '.$header : ''))."\n"; + print $fhbad $qual."\n"; + } + } else { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + } + if($stdoutbad) { + print STDOUT '>'.$seqid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + } else { + print $fhbad '>'.$seqid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + } + } + } + if($length2) { + $badcount2++; + $badbases2 += length($seq2); + return if($nobad); + #write data + if($params{out_format} == 3) { # FASTQ + &printError("missing quality data for sequence \"$seqid2\" or greater number of sequences than available quality scores") unless(defined $qual2); + if($stdoutbad) { + print STDOUT '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print STDOUT $qual2."\n"; + } else { + print $fh2bad '@'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2bad $seq2."\n"; + print $fh2bad '+'.(exists $params{no_qual_header} ? '' : $seqid2.($header2 ? ' '.$header2 : ''))."\n"; + print $fh2bad $qual2."\n"; + } + } else { #FASTA + #set line length + if($linelen) { + $seq2 =~ s/(.{$linelen})/$1\n/g; + $seq2 =~ s/\n$//; + } + if($stdoutbad) { + print STDOUT '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print STDOUT $seq2."\n"; + } else { + print $fh2bad '>'.$seqid2.($header2 ? ' '.$header2 : '')."\n"; + print $fh2bad $seq2."\n"; + } + } + } + } +} + +#process sequence (and qual) data +sub processData { + my ($sid,$seq,$qual,$header) = @_; + #assume sequence is good ;-) + my $good = 1; + my $seqn = uc($seq); + my $qualn = $qual; + my $begin = 0; + my $end = 0; + my ($length,$bylength,$qualsnums); + + #check for maximum number sequences requested + if(exists $params{seq_num} && $params{seq_num} <= $seqcount) { + $good = 0; + $filtercount{seq_num}++; + } + + #trim sequence ends + if($good && exists $params{trim_left}) { + $begin += $params{trim_left}; + if($begin >= length($seqn)) { + $good = 0; + $filtercount{trim_left}++; + } else { + $seqn = substr($seqn,$begin); + $qualn = substr($qualn,$begin) if(defined $qualn && length($qualn)); + } + } + if($good && exists $params{trim_right}) { + $end += $params{trim_right}; + $length = length($seqn); + if($end >= $length) { + $good = 0; + $filtercount{trim_right}++; + } else { + $seqn = substr($seqn,0,$length-$end); + $qualn = substr($qualn,0,$length-$end) if(defined $qualn && length($qualn)); + } + } + if($good && exists $params{trim_left_p}) { + $length = length($seqn); + my $begintmp = int($params{trim_left_p}/100*$length); + if($begintmp >= $length) { + $good = 0; + $filtercount{trim_left_p}++; + } else { + $seqn = substr($seqn,$begintmp); + $qualn = substr($qualn,$begintmp) if(defined $qualn && length($qualn)); + $begin += $begintmp; + } + } + if($good && exists $params{trim_right_p}) { + $length = length($seqn); + my $endtmp = int($params{trim_right_p}/100*$length); + if($endtmp >= $length) { + $good = 0; + $filtercount{trim_right_p}++; + } else { + $seqn = substr($seqn,0,$length-$endtmp); + $qualn = substr($qualn,0,$length-$endtmp) if(defined $qualn && length($qualn)); + $end += $endtmp; + } + } + + #check for quality scores + if($good && defined $qualn && $trimscore) { + my ($err); + if(exists $params{phred64}) { #scale data to Phred scale if necessary + ($qualsnums,$err) = &convertQualAsciiToNumsPhred64($qualn); + if($err) { + &printError("The sequence quality scores are not in Phred+64 format"); + } + } else { + $qualsnums = &convertQualAsciiToNums($qualn); + } + $length = length($qualn); + my $i = 0; + my $begintmp = 0; + my $endtmp = 0; + my ($window,$val); + #left + if(exists $params{trim_qual_left}) { + while($i < $length) { + #calculate maximum window + $window = ($i+$params{trim_qual_window} <= $length ? $params{trim_qual_window} : ($length-$i)); + #calculate value used to compare with given value + if($window == 1) { + $val = $qualsnums->[$i]; + } elsif($params{trim_qual_type} eq 'min') { + $val = min(@$qualsnums[$i..($i+$window-1)]); + } elsif($params{trim_qual_type} eq 'max') { + $val = max(@$qualsnums[$i..($i+$window-1)]); + } elsif($params{trim_qual_type} eq 'mean') { + $val = &getArrayMean(@$qualsnums[$i..($i+$window-1)]); + } elsif($params{trim_qual_type} eq 'sum') { + last if($window < $params{trim_qual_window}); + $val = sum(@$qualsnums[$i..($i+$window-1)]); + } else { + last; + } + #compare values + if(($params{trim_qual_rule} eq 'lt' && $val < $params{trim_qual_left}) || ($params{trim_qual_rule} eq 'gt' && $val > $params{trim_qual_left}) || ($params{trim_qual_rule} eq 'et' && $val == $params{trim_qual_left})) { + $begintmp += $params{trim_qual_step}; + $i += $params{trim_qual_step}; + } else { + last; + } + } + if($begintmp >= $length) { + $good = 0; + $filtercount{trim_qual_left}++; + } elsif($begintmp > 0) { + $seqn = substr($seqn,$begintmp); + $qualn = substr($qualn,$begintmp); + $begin += $begintmp; + } + } + #right + if($good && exists $params{trim_qual_right}) { + $length -= $begintmp; + my @quals = reverse(@$qualsnums); + $i = 0; + while($i < $length) { + #calculate maximum window + $window = ($i+$params{trim_qual_window} <= $length ? $params{trim_qual_window} : ($length-$i)); + #calculate value used to compare with given value + if($window == 1) { + $val = $quals[$i]; + } elsif($params{trim_qual_type} eq 'min') { + $val = min(@quals[$i..($i+$window-1)]); + } elsif($params{trim_qual_type} eq 'max') { + $val = max(@quals[$i..($i+$window-1)]); + } elsif($params{trim_qual_type} eq 'mean') { + $val = &getArrayMean(@quals[$i..($i+$window-1)]); + } elsif($params{trim_qual_type} eq 'sum') { + last if($window < $params{trim_qual_window}); + $val = sum(@quals[$i..($i+$window-1)]); + } else { + last; + } + #compare values + if(($params{trim_qual_rule} eq 'lt' && $val < $params{trim_qual_right}) || ($params{trim_qual_rule} eq 'gt' && $val > $params{trim_qual_right}) || ($params{trim_qual_rule} eq 'et' && $val == $params{trim_qual_right})) { + $endtmp += $params{trim_qual_step}; + $i += $params{trim_qual_step}; + } else { + last; + } + } + if($endtmp >= $length) { + $good = 0; + $filtercount{trim_qual_right}++; + } elsif($endtmp > 0) { + $seqn = substr($seqn,0,$length-$endtmp); + $qualn = substr($qualn,0,$length-$endtmp); + $end += $endtmp; + } + } + } + + #check for tails with min trimtails char repeats + if($good && exists $params{trim_tail_left}) { + $length = length($seqn); + my $begintmp = 0; + if($seqn =~ $repAleft || $seqn =~ $repTleft) { + my @tmp = split('',$seqn); + my $tmpchar = $tmp[0]; #A or T + $begintmp += $params{trim_tail_left}; + foreach ($params{trim_tail_left}..$length-1) { + last unless($tmp[$_] eq $tmpchar || $tmp[$_] eq 'N'); + $begintmp++; + } + if($begintmp >= $length) { + $good = 0; + $filtercount{trim_tail_left}++; + } else { + $seqn = substr($seqn,$begintmp); + $qualn = substr($qualn,$begintmp) if(defined $qualn && length($qualn)); + $length = length($seqn); + $begin += $begintmp; + } + } + } + if($good && exists $params{trim_tail_right}) { + $length = length($seqn); + my $endtmp = 0; + if($seqn =~ $repAright || $seqn =~ $repTright) { + my @tmp = split('',$seqn); + my $tmpchar = $tmp[$length-1]; #A or T + $endtmp += $params{trim_tail_right}; + foreach (reverse 0..$length-$params{trim_tail_right}-1) { + last unless($tmp[$_] eq $tmpchar || $tmp[$_] eq 'N'); + $endtmp++; + } + if($endtmp >= $length) { + $good = 0; + $filtercount{trim_tail_right}++; + } else { + $seqn = substr($seqn,0,$length-$endtmp); + $qualn = substr($qualn,0,$length-$endtmp) if(defined $qualn && length($qualn)); + $end += $endtmp; + } + } + } + if($good && exists $params{trim_ns_left}) { + $length = length($seqn); + my $begintmp = 0; + if($seqn =~ $repNleft) { + my @tmp = split('',$seqn); + $begintmp += $params{trim_ns_left}; + foreach ($params{trim_ns_left}..$length-1) { + last unless($tmp[$_] eq 'N'); + $begintmp++; + } + if($begintmp >= $length) { + $good = 0; + $filtercount{trim_ns_left}++; + } else { + $seqn = substr($seqn,$begintmp); + $qualn = substr($qualn,$begintmp) if(defined $qualn && length($qualn)); + $length = length($seqn); + $begin += $begintmp; + } + } + } + if($good && exists $params{trim_ns_right}) { + $length = length($seqn); + my $endtmp = 0; + if($seqn =~ $repNright) { + my @tmp = split('',$seqn); + $endtmp += $params{trim_ns_right}; + foreach (reverse 0..$length-$params{trim_ns_right}-1) { + last unless($tmp[$_] eq 'N'); + $endtmp++; + } + if($endtmp >= $length) { + $good = 0; + $filtercount{trim_ns_right}++; + } else { + $seqn = substr($seqn,0,$length-$endtmp); + $qualn = substr($qualn,0,$length-$endtmp) if(defined $qualn && length($qualn)); + $end += $endtmp; + } + } + } + + #check if trim to certain length + $length = length($seqn); + if($good && exists $params{trim_to_len} && $length > $params{trim_to_len}) { + $seqn = substr($seqn,0,$params{trim_to_len}); + $qualn = substr($qualn,0,$params{trim_to_len}) if(defined $qualn && length($qualn)); + $end += ($length-$params{trim_to_len}); + } + + #check for sequence length + $length = length($seqn); + $bylength = ($length ? 100/$length : 0); + if($bylength == 0) { + $good = 0; + $filtercount{zero_length}++; + } + if($good && exists $params{min_len} && $length < $params{min_len}) { + $good = 0; + $filtercount{min_len}++; + } + if($good && exists $params{max_len} && $length > $params{max_len}) { + $good = 0; + $filtercount{max_len}++; + } + if($good && exists $params{range_len} && !&checkRange($params{range_len},$length)) { + $good = 0; + $filtercount{range_len}++; + } + + #check for quality scores + if($good && defined $qualn && (exists $params{min_qual_score} || exists $params{max_qual_score} || exists $params{min_qual_mean} || exists $params{max_qual_mean})) { + my ($err); + if($qualsnums) { + if($begin > 0) { + shift(@$qualsnums) foreach(1..$begin); + } + if($end > 0) { + pop(@$qualsnums) foreach(1..$end); + } + } else { + if(exists $params{phred64}) { #scale data to Phred scale if necessary + ($qualsnums,$err) = &convertQualAsciiToNumsPhred64($qualn); + if($err) { + &printError("The sequence quality scores are not in Phred+64 format"); + } + } else { + $qualsnums = &convertQualAsciiToNums($qualn); + } + } + if($good && exists $params{min_qual_score} && min(@$qualsnums) < $params{min_qual_score}) { + $good = 0; + $filtercount{min_qual_score}++; + } + if($good && exists $params{max_qual_score} && max(@$qualsnums) < $params{max_qual_score}) { + $good = 0; + $filtercount{max_qual_score}++; + } + if($good && exists $params{min_qual_mean} && &getArrayMean(@$qualsnums) < $params{min_qual_mean}) { + $good = 0; + $filtercount{min_qual_mean}++; + } + if($good && exists $params{max_qual_mean} && &getArrayMean(@$qualsnums) < $params{max_qual_mean}) { + $good = 0; + $filtercount{max_qual_mean}++; + } + } + + #check for GC content + if($good && (exists $params{min_gc} || exists $params{max_gc} || exists $params{range_gc})) { + my $gc = ($seqn =~ tr/GC//); + $gc = sprintf("%d",$gc*$bylength); + if(exists $params{min_gc} && $gc < $params{min_gc}) { + $good = 0; + $filtercount{min_gc}++; + } + if($good && exists $params{max_gc} && $gc > $params{max_gc}) { + $good = 0; + $filtercount{max_gc}++; + } + if($good && exists $params{range_gc} && !&checkRange($params{range_gc},$gc)) { + $good = 0; + $filtercount{range_gc}++; + } + } + + #check for N's in sequence + if($good && (exists $params{ns_max_p} || exists $params{ns_max_n})) { + my $ns = ($seqn =~ tr/N//); + if(exists $params{ns_max_p} && ($ns*$bylength) > $params{ns_max_p}) { + $good = 0; + $filtercount{ns_max_p}++; + } + if($good && exists $params{ns_max_n} && $ns > $params{ns_max_n}) { + $good = 0; + $filtercount{ns_max_n}++; + } + } + + #check for non IUPAC chars in sequence + if($good && exists $params{noniupac} && $seqn =~ /[^ACGTN]/o) { + $good = 0; + $filtercount{noniupac}++; + } + + #check for additional filter parameters + if($good && @cps) { + foreach my $p (@cps) { + if($p->[0]) { #repeats + if(index($seqn,$p->[1]x$p->[2]) != -1) { + $good = 0; + $filtercount{custom_params}++; + last; + } + } else { #percentage + my $ns = 0; + my $v = $p->[1]; + $ns++ while($seqn =~ /$v/g); + if((100*$ns/$length) > $p->[2]) { + $good = 0; + $filtercount{custom_params}++; + last; + } + } + } + } + + #check for sequence complexity + if($good && defined $complval) { + my ($rest,$steps,@vals,$str,$num,$bynum); + if($length <= $WINDOWSIZE) { + $rest = $length; + $steps = 0; + } else { + $steps = int(($length - $WINDOWSIZE) / $WINDOWSTEP) + 1; + $rest = $length - $steps * $WINDOWSTEP; + unless($rest > $WINDOWSTEP) { + $rest += $WINDOWSTEP; + $steps--; + } + } + $num = $WINDOWSIZE-2; + $bynum = 1/$num; + $num--; + my $mean = 0; + if($params{lc_method} eq 'dust') { + my $dustscore; + foreach my $i (0..$steps-1) { + $str = substr($seqn,($i * $WINDOWSTEP),$WINDOWSIZE); + %counts = (); + foreach my $i (@WINDOWSIZEARRAY) { + $counts{substr($str,$i,3)}++; + } + $dustscore = 0; + foreach(values %counts) { + $dustscore += ($_ * ($_ - 1) * $POINTFIVE); + } + push(@vals,($dustscore * $bynum)); + } + #last step + if($rest > 5) { + $str = substr($seqn,($steps * $WINDOWSTEP),$rest); + %counts = (); + $num = $rest-2; + foreach my $i (0..($num - 1)) { + $counts{substr($str,$i,3)}++; + } + $dustscore = 0; + foreach(values %counts) { + $dustscore += ($_ * ($_ - 1) * $POINTFIVE); + } + push(@vals,(($dustscore / ($num-1)) * (($WINDOWSIZE - 2) / $num))); + } else { + push(@vals,31); #to assign a maximum score based on the scaling factor 100/31 + } + $mean = &getArrayMean(@vals); + if(int($mean * 100 / 31) > $complval) { + $good = 0; + $filtercount{lc_method}++; + } + } elsif($params{lc_method} eq 'entropy') { + my $entropyval; + foreach my $i (0..$steps-1) { + $str = substr($seqn,($i * $WINDOWSTEP),$WINDOWSIZE); + %counts = (); + foreach my $i (@WINDOWSIZEARRAY) { + $counts{substr($str,$i,3)}++; + } + $entropyval = 0; + foreach(values %counts) { + $entropyval -= ($_ * $bynum) * log($_ * $bynum); + } + push(@vals,($entropyval * $ONEOVERLOG62)); + } + #last step + if($rest > 5) { + $str = substr($seqn,($steps * $WINDOWSTEP),$rest); + %counts = (); + $num = $rest-2; + foreach my $i (0..($num - 1)) { + $counts{substr($str,$i,3)}++; + } + $entropyval = 0; + $bynum = 1/$num; + foreach(values %counts) { + $entropyval -= ($_ * $bynum) * log($_ * $bynum); + } + push(@vals,($entropyval / log($num))); + } else { + push(@vals,0); + } + $mean = &getArrayMean(@vals); + if(int($mean * 100) < $complval) { + $good = 0; + $filtercount{lc_method}++; + } + } + } + + #stop here for paired-end reads + if($file2) { + return ($seqn,$good,$begin,$end); + } + + #check for read duplicates + if($good && $derep) { + if($exactonly) { + $md5 = md5_hex($seqn); + if(exists $dereptypes{0}) { + if(exists $md5s{$md5}) { #forward duplicate + $md5s{$md5}->{0}++; + if($derepmin <= $md5s{$md5}->{0}+1) { + $good = 0; + $filtercount{derep}++; + } + } + } + if($good && exists $dereptypes{3}) { + $md5r = md5_hex(&revcompuc($seqn)); + if(exists $md5s{$md5r}) { #reverse duplicate + $md5s{$md5}->{3}++; + if($derepmin <= $md5s{$md5}->{3}+1) { + $good = 0; + $filtercount{derep}++; + } + } + } + unless(exists $md5s{$md5}) { + $md5s{$md5} = {0 => 0, 3 => 0}; + } + } else { + push(@seqsP,[$seqn,$goodcount++,$length]); + #keep write data for possible duplicates + if($params{out_format} == 1) { #FASTA + push(@printtmp,[$sid,$header,$seq,$begin,$end,'']); + } else { # FASTQ or FASTA+QUAL or FASTQ+FASTA or FASTQ+FASTA+QUAL + push(@printtmp,[$sid,$header,$seq,$begin,$end,$qual]); + } + } + } + + if($good && (($derep && $exactonly) || !$derep)) { #passed filters + $seqcount++; + $seqbases += $length; + return if($nogood); + #check if change of sequence ID + if(exists $params{seq_id}) { + if($mappings) { + print $fhmappings join("\t",$sid,$params{seq_id}.$seqcount)."\n"; + } + $sid = $params{seq_id}.$seqcount; + } + if(exists $params{rm_header}) { + $header = undef; + } + #trim if necessary + if($begin) { + $seq = substr($seq,$begin); + $qual = substr($qual,$begin) if(defined $qual && length($qual)); + } + if($end) { + $length = length($seq); + $seq = substr($seq,0,$length-$end); + $qual = substr($qual,0,$length-$end) if(defined $qual && length($qual)); + } + #change case + if(exists $params{seq_case}) { + if($params{seq_case} eq 'lower') { #lower case + $seq = lc($seq); + } elsif($params{seq_case} eq 'upper') { #upper case + $seq = uc($seq); + } + } + #convert between DNA and RNA + if(exists $params{dna_rna}) { + if($params{dna_rna} eq 'dna') { #RNA to DNA + $seq =~ tr/Uu/Tt/; + } elsif($params{dna_rna} eq 'rna') { #DNA to RNA + $seq =~ tr/Tt/Uu/; + } + } + #write data + if($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5) { # FASTQ + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + if($stdoutgood) { + print STDOUT '@'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + } else { + print $fhgood '@'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood $seq."\n"; + print $fhgood '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print $fhgood $qual."\n"; + } + } + if($params{out_format} == 1 || $params{out_format} == 2 || $params{out_format} == 4 || $params{out_format} == 5) { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + } + if($stdoutgood) { + print STDOUT '>'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + } elsif($params{out_format} == 1 || $params{out_format} == 2) { + print $fhgood '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood $seq."\n"; + } else { + print $fhgood3 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood3 $seq."\n"; + } + } + if($params{out_format} == 2 || $params{out_format} == 5) { #QUAL + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + print $fhgood2 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood2 &convertQualArrayToString(&convertQualAsciiToNums($qual),$linelen)."\n"; + } + } elsif($good && $derep && !$exactonly) { + #do nothing as sequences will be used for duplicate check + } else { #filtered out + $badcount++; + $badbases += length($seq); + return if($nobad); + #write data + if($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5) { # FASTQ + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + if($stdoutbad) { + print STDOUT '@'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + } else { + print $fhbad '@'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + print $fhbad '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print $fhbad $qual."\n"; + } + } + if($params{out_format} == 1 || $params{out_format} == 2 || $params{out_format} == 4 || $params{out_format} == 5) { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + } + if($stdoutbad) { + print STDOUT '>'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + } elsif($params{out_format} == 1 || $params{out_format} == 2) { + print $fhbad '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + } else { + print $fhbad3 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad3 $seq."\n"; + } + } + if($params{out_format} == 2 || $params{out_format} == 5) { #QUAL + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + print $fhbad2 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad2 &convertQualArrayToString(&convertQualAsciiToNums($qual),$linelen)."\n"; + } + } +} + +#dereplicate sequences +sub derepSeqs { + my $numseqs = scalar(@seqsP); + if($derep && $numseqs) { + my ($sid,$seq,$qual,$header,$begin,$end); + my ($dcounts,undef,$dupls) = &checkForDupl(\@seqsP,\%dereptypes,$numseqs); + + print STDERR "Write results to output file(s)\n" if(exists $params{verbose}); + #for progress bar + my $progress = 0; + my $counter = 1; + my $part = int($numseqs/100); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + + foreach my $i (0..$numseqs-1) { + #progress bar stuff + $counter++; + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + } + #get data + $sid = $printtmp[$i]->[0]; + $seq = $printtmp[$i]->[2]; + $qual = $printtmp[$i]->[5]; + $header = $printtmp[$i]->[1]; + $begin = $printtmp[$i]->[3]; + $end = $printtmp[$i]->[4]; + #write data + if(exists $dupls->{$i} || (exists $params{seq_num} && $params{seq_num} <= $seqcount)) { #bad + $filtercount{derep}++; + $badcount++; + $badbases += length($seq); + next if($nobad); + #write data + if($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5) { # FASTQ + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + if($stdoutbad) { + print STDOUT '@'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + } else { + print $fhbad '@'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + print $fhbad '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print $fhbad $qual."\n"; + } + } + if($params{out_format} == 1 || $params{out_format} == 2 || $params{out_format} == 4 || $params{out_format} == 5) { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + } + if($stdoutbad) { + print STDOUT '>'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + } elsif($params{out_format} == 1 || $params{out_format} == 2) { + print $fhbad '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad $seq."\n"; + } else { + print $fhbad3 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad3 $seq."\n"; + } + } + if($params{out_format} == 2 || $params{out_format} == 5) { #QUAL + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + print $fhbad2 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhbad2 &convertQualArrayToString(&convertQualAsciiToNums($qual),$linelen)."\n"; + } + } else { #good + $seqcount++; + $seqbases += (length($seq)-$begin-$end); + next if($nogood); + #check if change of sequence ID + if(exists $params{seq_id}) { + if($mappings) { + print $fhmappings join("\t",$sid,$params{seq_id}.$seqcount)."\n"; + } + $sid = $params{seq_id}.$seqcount; + } + if(exists $params{rm_header}) { + $header = undef; + } + #trim if necessary + if($begin) { + $seq = substr($seq,$begin); + $qual = substr($qual,$begin) if(defined $qual && length($qual)); + } + if($end) { + $length = length($seq); + $seq = substr($seq,0,$length-$end); + $qual = substr($qual,0,$length-$end) if(defined $qual && length($qual)); + } + #change case + if(exists $params{seq_case}) { + if($params{seq_case} eq 'lower') { #lower case + $seq = lc($seq); + } elsif($params{seq_case} eq 'upper') { #upper case + $seq = uc($seq); + } + } + #convert between DNA and RNA + if(exists $params{dna_rna}) { + if($params{dna_rna} eq 'dna') { #RNA to DNA + $seq =~ tr/Uu/Tt/; + } elsif($params{dna_rna} eq 'rna') { #DNA to RNA + $seq =~ tr/Tt/Uu/; + } + } + #write data + if($params{out_format} == 3 || $params{out_format} == 4 || $params{out_format} == 5) { # FASTQ + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + if($stdoutgood) { + print STDOUT '@'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + print STDOUT '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print STDOUT $qual."\n"; + } else { + print $fhgood '@'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood $seq."\n"; + print $fhgood '+'.(exists $params{no_qual_header} ? '' : $sid.($header ? ' '.$header : ''))."\n"; + print $fhgood $qual."\n"; + } + } + if($params{out_format} == 1 || $params{out_format} == 2 || $params{out_format} == 4 || $params{out_format} == 5) { #FASTA + #set line length + if($linelen) { + $seq =~ s/(.{$linelen})/$1\n/g; + $seq =~ s/\n$//; + } + if($stdoutgood) { + print STDOUT '>'.$sid.($header ? ' '.$header : '')."\n"; + print STDOUT $seq."\n"; + } elsif($params{out_format} == 1 || $params{out_format} == 2) { + print $fhgood '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood $seq."\n"; + } else { + print $fhgood3 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood3 $seq."\n"; + } + } + if($params{out_format} == 2 || $params{out_format} == 5) { #QUAL + &printError("missing quality data for sequence \"$sid\" or greater number of sequences than available quality scores") unless(defined $qual); + print $fhgood2 '>'.$sid.($header ? ' '.$header : '')."\n"; + print $fhgood2 &convertQualArrayToString(&convertQualAsciiToNums($qual),$linelen)."\n"; + } + } + } + print STDERR "\r\tdone \n" if(exists $params{verbose}); + } +} + +#calculate summary statistics from sequences +sub calcSeqStats { + my ($seq,$length,$stats,$kmers,$odds,$counts,$pair) = @_; + + #length related: min, max, range, mean, stddev, mode + if(exists $params{stats_len} || exists $params{stats_assembly}) { + $counts->{length}->{$length}++; + } + + #dinucleotide odds ratio related: aatt, acgt, agct, at, catg, ccgg, cg, gatc, gc, ta + if(exists $params{stats_dinuc}) { + &dinucOdds($seq,$length,$odds); + } + + #tag related: probability of 5' and 3' tag sequence based on kmer counts + if(exists $params{stats_tag}) { + #get kmers + if($length >= 5) { + #get 5' and 3' ends + my $str5 = substr($seq,0,5); + my $str3 = substr($seq,$length-5); + unless($str5 eq 'AAAAA' || $str5 eq 'TTTTT' || $str5 eq 'CCCCC' || $str5 eq 'GGGGG' || $str5 eq 'NNNNN') { + $kmers->{5}->{$str5}++; + } + unless($str3 eq 'AAAAA' || $str3 eq 'TTTTT' || $str3 eq 'CCCCC' || $str3 eq 'GGGGG' || $str3 eq 'NNNNN') { + $kmers->{3}->{$str3}++; + } + } + #check for MID tags + if($length >= $MIDCHECKLENGTH) { + my $str5 = substr($seq,0,$MIDCHECKLENGTH); + foreach my $mid (keys %MIDS) { + if(index($str5,$mid) != -1) { + $MIDS{$mid}++; + last; + } + } + } + } + + #ambiguous base N related: seqswithn, maxp + if(exists $params{stats_ns}) { + my $bylength = 100/$length; + my $ns = ($seq =~ tr/N//); + if($pair) { + $stats->{stats_ns2}->{seqswithn}++ if($ns > 0); + $stats->{stats_ns2}->{maxn} = $ns if($ns > ($stats->{stats_ns2}->{maxn}||0)); + $ns = ($ns > 0 && $ns*$bylength < 1 ? 1 : sprintf("%d",$ns*$bylength)); + $stats->{stats_ns2}->{maxp} = $ns if($ns > ($stats->{stats_ns2}->{maxp}||0)); + } else { + $stats->{stats_ns}->{seqswithn}++ if($ns > 0); + $stats->{stats_ns}->{maxn} = $ns if($ns > ($stats->{stats_ns}->{maxn}||0)); + $ns = ($ns > 0 && $ns*$bylength < 1 ? 1 : sprintf("%d",$ns*$bylength)); + $stats->{stats_ns}->{maxp} = $ns if($ns > ($stats->{stats_ns}->{maxp}||0)); + } + } +} + +#dinucleotide odds ratio calculation +sub dinucOdds { + my ($seq,$length,$odds) = @_; + my ($mononum,$dinum,$i,$x,$y); + my %di = %DN_DI; + my (%mono,$lengthtmp); + my @tmp = split(/N+/,$seq); + foreach(@tmp) { + $lengthtmp = length($_)-1; + next unless($lengthtmp > 0); + $mono{AT} += ($_ =~ tr/AT//); + $mono{GC} += ($_ =~ tr/GC//); + $i = 0; + while($i < $lengthtmp) { + $di{substr($_,$i++,2)}++; + } + } + $dinum = sum(values %di); + + if($dinum) { + $mononum = sum(values %mono); + my $factor = 2 * $mononum * $mononum / $dinum; + my $AT = $mono{AT}; + my $GC = $mono{GC}; + if($AT) { + my $AT2 = $factor / ($AT * $AT); + $odds->{'AATT'} += ($di{'AA'} + $di{'TT'}) * $AT2; + $odds->{'AT'} += 2 * $di{'AT'} * $AT2; + $odds->{'TA'} += 2 * $di{'TA'} * $AT2; + if($GC) { + my $ATGC = $factor / ($AT * $GC); + $odds->{'ACGT'} += ($di{'AC'} + $di{'GT'}) * $ATGC; + $odds->{'AGCT'} += ($di{'AG'} + $di{'CT'}) * $ATGC; + $odds->{'CATG'} += ($di{'CA'} + $di{'TG'}) * $ATGC; + $odds->{'GATC'} += ($di{'GA'} + $di{'TC'}) * $ATGC; + my $GC2 = $factor / ($GC * $GC); + $odds->{'CCGG'} += ($di{'CC'} + $di{'GG'}) * $GC2; + $odds->{'CG'} += 2 * $di{'CG'} * $GC2; + $odds->{'GC'} += 2 * $di{'GC'} * $GC2; + } + } elsif($GC) { + my $GC2 = $factor / ($GC * $GC); + $odds->{'CCGG'} += ($di{'CC'} + $di{'GG'}) * $GC2; + $odds->{'CG'} += 2 * $di{'CG'} * $GC2; + $odds->{'GC'} += 2 * $di{'GC'} * $GC2; + } + } +} + +#calculate basic stats from an hash of number->count values +sub generateStats { + my $counts = shift; + my ($min,$max,$modeval,$mode,$mean,$count,$std,$x,$c,@vals,$num,$median,%stats); + + #min, max, mode and modeval + $min = -1; + $max = $modeval = $mean = $count = $std = $num = 0; + while (($x, $c) = each(%$counts)) { + if($min == -1) { + $min = $x; + } elsif($min > $x) { + $min = $x; + } + if($max < $x) { + $max = $x; + } + if($modeval < $c) { + $modeval = $c; + $mode = $x; + } + $mean += $x*$c; + $count += $c; + foreach(1..$c) { + push(@vals,$x); + $num++; + } + } + + #mean and stddev + $mean /= $count; + while (($x, $c) = each(%$counts)) { + $std += $c*(($x-$mean)**2); + } + + #median + if($num == 1) { + $median = $vals[0]; + } elsif($num == 2) { + $median = ($vals[0]+$vals[1])/2; + } else { + @vals = sort {$a <=> $b} @vals; + if($num % 2) { + $median = $vals[($num-1)/2]; + } else { + $median = ($vals[$num/2]+$vals[$num/2-1])/2; + } + } + + #save stats + $stats{min} = $min; + $stats{max} = $max; + $stats{range} = $max-$min+1; + $stats{modeval} = $modeval; + $stats{mode} = $mode; + $stats{mean} = sprintf("%.2f",$mean); + $stats{stddev} = sprintf("%.2f",($std/$count)**(1/2)); + $stats{median} = $median; + + return \%stats; +} + +sub generateStatsType { + my $counts = shift; + my (%stats,$min,$max,$modeval,$mode,$mean,$std,$x,$c,@vals,$num,$median,$p25,$p75,$numq,$i,$j,$median1,$median2,$p251,$p252,$p751,$p752); + + foreach my $kind (keys %$counts) { + @vals = (); + $min = -1; + $max = $modeval = $mean = $std = $num = 0; + foreach my $x1 (sort {$a <=> $b} keys %{$counts->{$kind}}) { + $c = $counts->{$kind}->{$x1}; + if($min == -1) { + $min = $x1; + } + if($max < $x1) { + $max = $x1; + } + if($modeval < $c) { + $modeval = $c; + $mode = $x1; + } + $mean += $x1*$c; + $num += $c; + push(@vals,[$c,$x1]); #count, values + } + + $mean /= $num; + while (($x, $c) = each(%{$counts->{$kind}})) { + $std += $c*(($x-$mean)**2); + } + + if($num == 1) { + $median = $p25 = $p75 = $vals[0]->[1]; + } elsif($num == 2) { + if($vals[0]->[0] == 1) { #two different numbers + $p25 = $vals[0]->[1]; + $p75 = $vals[1]->[1]; + $median = ($vals[0]->[1]+$vals[1]->[1])/2; + } else { + $p25 = $p75 = $median = $vals[0]->[1]; #both same + } + } elsif($num > 2) { + if($num % 2) { + $i = 0; + $j = 0; + while($i <= ($num-1)/2) { + $median = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $numq = ($num+1)/2; + } else { + $i = 0; + $j = 0; + while($i <= ($num/2-1)) { + $median1 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $median2 = $median1; + while($i <= ($num/2)) { + $median2 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $median = ($median1 + $median2)/2; + $numq = $num/2; + } + if($numq % 2) { + $i = 0; + $j = 0; + while($i <= (($numq-1)/2)) { + $p25 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $p75 = $p25; + while($i <= ($num-($numq-1)/2-1)) { + $p75 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + } else { + $i = 0; + $j = 0; + while($i <= ($numq/2-1)) { + $p251 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $p252 = $p251; + while($i <= ($numq/2)) { + $p252 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $p751 = $p252; + while($i <= ($num-$numq/2-1)) { + $p751 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $p752 = $p751; + while($i <= ($num-$numq/2)) { + $p752 = $vals[$j]->[1]; + $i += $vals[$j]->[0]; + $j++; + } + $p25 = ($p251 + $p252) / 2; + $p75 = ($p751 + $p752) / 2; + } + } else { + $median = $p25 = $p75 = 0; + } + + $stats{$kind}->{min} = $min; + $stats{$kind}->{max} = $max; + $stats{$kind}->{range} = $max-$min+1; + $stats{$kind}->{modeval} = $modeval; + $stats{$kind}->{mode} = $mode; + $stats{$kind}->{mean} = sprintf("%.2f",$mean); + $stats{$kind}->{std} = sprintf("%.2f",($std/$num)**(1/2)); + $stats{$kind}->{median} = int($median); + $stats{$kind}->{p25} = int($p25); + $stats{$kind}->{p75} = int($p75); + } + + return \%stats; +} + +#requires seqs array with [upper-case seq, array index, length] for each entry +sub checkForDupl { + #requires seqs array with [upper-case seq, array index, length] for each entry + my ($seqs,$types,$numseqs) = @_; + my (@sort,$num,%dupls,$pretype,$precount,%counts,%lens); + #precount = number duplicates for the same sequence + + print STDERR "Check for duplicates\n" if(exists $params{verbose}); + + #for progress bar + my $progress = 1; + my $counter = 1; + my $part = int($numseqs*4/100); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: duplicate-status $progress"); + + #exact duplicates and prefix duplicates + if(exists $types->{0} || exists $types->{1} || exists $types->{2}) { + $precount = 0; + $pretype = -1; + @sort = sort {$a->[0] cmp $b->[0]} @$seqs; + foreach my $i (0..$numseqs-2) { + if(exists $types->{0} && $sort[$i]->[2] == $sort[$i+1]->[2] && $sort[$i]->[0] eq $sort[$i+1]->[0]) { + $dupls{$sort[$i]->[1]} = 0; + $lens{$sort[$i]->[2]}->{0}++; + if($pretype == 0) { + $precount++; + } else { + if($pretype == 1 && $precount) { + $counts{$precount}->{$pretype}++; + } + $pretype = 0; + $precount = 1; + } + } elsif(exists $types->{1} && $sort[$i]->[2] < $sort[$i+1]->[2] && $sort[$i]->[0] eq substr($sort[$i+1]->[0],0,$sort[$i]->[2])) { + $dupls{$sort[$i]->[1]} = 1; + $lens{$sort[$i]->[2]}->{1}++; + if($pretype == 1) { + $precount++; + } else { + if($pretype == 0 && $precount) { + $counts{$precount}->{$pretype}++; + } + $pretype = 1; + $precount = 1; + } + } else { + if($precount) { + $counts{$precount}->{$pretype}++; + $precount = 0; + } + $pretype = -1; + } + $sort[$i] = undef; + #progress bar stuff + $counter++; + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: duplicate-status $progress"); + } + } + if($precount) { + $counts{$precount}->{$pretype}++; + } + } + #suffix duplicates + if(exists $types->{2}) { + $num = 0; + @sort = (); + foreach(@$seqs) { + next if(exists $dupls{$_->[1]}); + push(@sort,[(scalar reverse $_->[0]),$_->[1],$_->[2]]); + $num++; + } + if($num > 1) { + $precount = 0; + $pretype = -1; + @sort = sort {$a->[0] cmp $b->[0]} @sort; + foreach my $i (0..$num-2) { + if($sort[$i]->[2] < $sort[$i+1]->[2] && $sort[$i]->[0] eq substr($sort[$i+1]->[0],0,$sort[$i]->[2])) { + $dupls{$sort[$i]->[1]} = 2; + $lens{$sort[$i]->[2]}->{2}++; + if($pretype == 2) { + $precount++; + } else { + $pretype = 2; + $precount = 1; + } + } else { + if($precount) { + $counts{$precount}->{$pretype}++; + $precount = 0; + } + $pretype = -1; + } + $sort[$i] = undef; + #progress bar stuff + $counter++; + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: duplicate-status $progress"); + } + } + if($precount) { + $counts{$precount}->{$pretype}++; + } + } + } + #reverse complement exact and prefix/suffix duplicates + if(exists $types->{3} || exists $types->{4}) { + $num = 0; + @sort = (); + foreach(@$seqs) { + if(exists $dupls{$_->[1]}) { + $counter++; + next; + } + push(@sort,[$_->[0],$_->[1],$_->[2],0]); + push(@sort,[&revcompuc($_->[0]),$_->[1],$_->[2],1]); + $num += 2; + } + if($num > 1) { + $precount = 0; + $pretype = -1; + @sort = sort {$a->[0] cmp $b->[0]} @sort; + foreach my $i (0..$num-2) { + unless($sort[$i]->[3] == $sort[$i+1]->[3] || $sort[$i]->[1] eq $sort[$i+1]->[1] || exists $dupls{$sort[$i]->[1]}) { #don't check if both same (original or revcomp) or already counted as dubs + if(exists $types->{3} && $sort[$i]->[2] == $sort[$i+1]->[2] && $sort[$i]->[0] eq $sort[$i+1]->[0]) { + $dupls{$sort[$i]->[1]} = 3; + $lens{$sort[$i]->[2]}->{3}++; + if($pretype == 3) { + $precount++; + } else { + if($pretype == 4 && $precount) { + $counts{$precount}->{$pretype}++; + } + $pretype = 3; + $precount = 1; + } + } elsif(exists $types->{4} && $sort[$i]->[2] < $sort[$i+1]->[2] && $sort[$i]->[0] eq substr($sort[$i+1]->[0],0,$sort[$i]->[2])) { + $dupls{$sort[$i]->[1]} = 4; + $lens{$sort[$i]->[2]}->{4}++; + if($pretype == 4) { + $precount++; + } else { + if($pretype == 3 && $precount) { + $counts{$precount}->{$pretype}++; + } + $pretype = 4; + $precount = 1; + } + } else { + if($precount) { + $counts{$precount}->{$pretype}++; + $precount = 0; + } + $pretype = -1; + } + } + $sort[$i] = undef; + #progress bar stuff + $counter++; + if($counter > $part) { + $counter = 1; + $progress++; + $progress = 99 if($progress > 99); + print STDERR "\r\tstatus: ".int($progress)." \%" if(exists $params{verbose}); + &printWeb("STATUS: duplicate-status $progress"); + } + } + if($precount) { + $counts{$precount}->{$pretype}++; + } + } + } + print STDERR "\r\tdone \n" if(exists $params{verbose}); + &printWeb("STATUS: duplicate-status 100"); + return (\%counts,\%lens,\%dupls); +} + +#get the frequency of possible tags by shifting kmers by max 2 positions when aligned +sub getTagFrequency { + my ($kmers) = @_; + + #find most abundant kmer counts + my $percentone = $numseqs/100; + my $percentten = $numseqs/10; + my %most; + foreach my $sp (keys %$kmers) { + $most{$sp}->{max} = 0; + foreach(keys %{$kmers->{$sp}}) { +# next if($_ eq 'A'x5 || $_ eq 'T'x5 || $_ eq 'C'x5 || $_ eq 'G'x5 || $_ eq 'N'x5); + if($kmers->{$sp}->{$_} >= $percentten) { + $most{$sp}->{ten}++; + } elsif($kmers->{$sp}->{$_} >= $percentone) { + $most{$sp}->{one}++; + } + #get max count + $most{$sp}->{max} = $kmers->{$sp}->{$_} if($most{$sp}->{max} < $kmers->{$sp}->{$_}); + } + } + + #filter kmers by frequency - threshold of >10% occurrence -> max of 9 different kmers or more if there is non with >10% occurrence + my $numseqssub = $numseqs/10; + my $onecount = 2; + foreach my $sp (keys %$kmers) { + foreach(keys %{$kmers->{$sp}}) { + if(exists $most{$sp}->{ten} && $most{$sp}->{ten} > 0) { + delete $kmers->{$sp}->{$_} if($kmers->{$sp}->{$_} < $percentten); + } elsif(exists $most{$sp}->{one} && $most{$sp}->{one} > 0) { + delete $kmers->{$sp}->{$_} if($kmers->{$sp}->{$_} < $percentone); + } else { + delete $kmers->{$sp}->{$_} if($kmers->{$sp}->{$_} != $most{$sp}->{max}); + } + } + } + + my (%kmersum,%kmershift); + foreach my $sp (sort {$b <=> $a} keys %$kmers) { #5' before 3' + #if more than one kmer in array, test if shifted by max 2 positions + my $numkmer = scalar(keys %{$kmers->{$sp}}); + + if($numkmer > 1) { + my @matrix; + my @kmersort = sort {$kmers->{$sp}->{$b} <=> $kmers->{$sp}->{$a}} keys %{$kmers->{$sp}}; + foreach my $i (0..($numkmer-2)) { + foreach my $j (($i+1)..($numkmer-1)) { + $matrix[$i]->[$j-($i+1)] = &align2seqs($kmersort[$j],$kmersort[$i]); + } + } + my $countgood = 0; + foreach my $i (0..($numkmer-2)) { + unless(@{$matrix[0]->[$i]}) { #not matching + my $count = 0; + foreach my $j (1..($numkmer-2)) { + $count++; + last if(defined $matrix[$j]->[$i-$j]); #found shift using other kmers + } + if($count < ($numkmer-1) && $i > 0) { + my $sum = 0; + my $sign; + foreach my $j (0..$count) { + next unless(defined $matrix[$j] && defined $matrix[$j]->[$i-1]); #fix: 08/2010 + if(defined $sign) { + if(($sign < 0 && (defined $matrix[$j]->[$i-1]->[0] && $matrix[$j]->[$i-1]->[0] < 0)) || ($sign > 0 && (defined $matrix[$j]->[$i-1]->[0] && $matrix[$j]->[$i-1]->[0] > 0))) { + $sum += $matrix[$j]->[$i-1]->[0]; + } elsif(($sign < 0 && (defined $matrix[$j]->[$i-1]->[1] && $matrix[$j]->[$i-1]->[1] < 0)) || $sign > 0 && (defined $matrix[$j]->[$i-1]->[1] && $matrix[$j]->[$i-1]->[1] > 0)) { + $sum += $matrix[$j]->[$i-1]->[1]; + } + } elsif(defined $matrix[$j]->[$i-1]->[0]) { + $sum += $matrix[$j]->[$i-1]->[0]; + } + $sign = ((defined $matrix[$j]->[$i-1]->[0] && $matrix[$j]->[$i-1]->[0] < 0) ? -1 : 1); + } + $matrix[0]->[$i] = [$sum] if(defined $sign); #fix: 08/2010 + } + } + unless(@{$matrix[0]->[$i]}) { + last; + } else { + $countgood++; + } + } + if($countgood) { + my $min; + if($sp == 3) { #3' prime end, 5 for 5' end + #find maximum shift to right (pos value) + $min = -100; + foreach my $i (0..($countgood-1)) { + $min = ((defined $matrix[0]->[$i]->[0] && $min > $matrix[0]->[$i]->[0]) ? $min : $matrix[0]->[$i]->[0]); + } + if($min > 0) { + $min = -$min; + } else { + $min = 0; + } + } else { + #find maximum shift to left (neg value) + $min = 100; + foreach my $i (0..($countgood-1)) { + $min = ((defined $matrix[0]->[$i]->[0] && $min < $matrix[0]->[$i]->[0]) ? $min : $matrix[0]->[$i]->[0]); + } + if($min < 0) { + $min = abs($min); + } else { + $min = 0; + } + } +# $kmershift{$sp}->{$kmersort[0]} = $min; + $kmersum{$sp} += $kmers->{$sp}->{$kmersort[0]}; + foreach my $i (0..($countgood-1)) { +# $kmershift{$sp}->{$kmersort[$i+1]} = $matrix[0]->[$i]->[0]+$min; + $kmersum{$sp} += $kmers->{$sp}->{$kmersort[$i+1]}; + } + } else { + my $tmp = (sort {$kmers->{$sp}->{$b} <=> $kmers->{$sp}->{$a}} keys %{$kmers->{$sp}})[0]; +# $kmershift{$sp}->{$tmp} = 0; + $kmersum{$sp} += $kmers->{$sp}->{$tmp}; + } + } elsif($numkmer == 1) { + my $tmp = (keys %{$kmers->{$sp}})[0]; +# $kmershift{$sp}->{$tmp} = 0; + $kmersum{$sp} += $kmers->{$sp}->{$tmp}; + } + } + + return \%kmersum; +} + +sub align2seqs { + my ($seq1,$seq2) = @_; + my @shift; + + #get number of shifted positions + if(substr($seq1,0,4) eq substr($seq2,1,4)) { #shift right by 1 + push(@shift,1); + } elsif(substr($seq1,0,3) eq substr($seq2,2,3)) { #shift right by 2 + push(@shift,2); + } + if(substr($seq1,1,4) eq substr($seq2,0,4)) { #shift left by 1 + push(@shift,-1); + } elsif(substr($seq1,2,3) eq substr($seq2,0,3)) { #shift left by 2 + push(@shift,-2); + } + + return \@shift; +} + +sub revcompuc { + my $seq = shift; + $seq = scalar reverse $seq; + $seq =~ tr/GATC/CTAG/; + return $seq; +} + +sub compuc { + my $seq = shift; + $seq =~ tr/GATC/CTAG/; + return $seq; +} + +#get data for graphs +sub getSeqStats { + my ($graphdata,$seqgd,$length) = @_; + if($length > $maxlength) { + $maxlength = $length; + } + my ($gc,$ns,$begin,$end,$str5,$str3,$bylength,$tmp); + $begin = $end = $gc = $ns = 0; + #get length + $bylength = 100/$length; + #get 5' and 3' ends + if($webstats{pt} || $webstats{ts} || $graphstats{pt} || $graphstats{ts}) { + $str5 = substr($seqgd,0,5); + $str3 = substr($seqgd,$length-5); + } + #GC content + if($webstats{gc} || $graphstats{gc}) { + $gc = ($seqgd =~ tr/GC//); + $gc = sprintf("%d",$gc*$bylength); + } + + #N's + if($webstats{ns} || $graphstats{ns}) { + $ns = ($seqgd =~ tr/N//); + $ns = ($ns > 0 && $ns*$bylength < 1 ? 1 : sprintf("%d",$ns*$bylength)); + } + + #tail stuff with min 5 char repeats + if($webstats{pt} || $graphstats{pt}) { + #at sequence 5'-end + if($str5 eq 'AAAAA' || $str5 eq 'TTTTT') { + my $tmpchar = substr($str5,0,1); #A or T + $begin = 5; + foreach(5..$length-1) { + $tmp = substr($seqgd,$_,1); + last unless($tmp eq $tmpchar || $tmp eq 'N'); + $begin++; + } + } + #at sequence 3'-end + if($str3 eq 'AAAAA' || $str3 eq 'TTTTT') { + my $tmpchar = substr($str3,0,1); #A or T + $end = 5; + foreach (reverse 0..$length-6) { + $tmp = substr($seqgd,$_,1); + last unless($tmp eq $tmpchar || $tmp eq 'N'); + $end++; + } + } + } + + #get base frequencies + if($webstats{ts} || $graphstats{ts}) { + if($length >= $TAG_LENGTH) { + foreach my $i (0..$TAG_LENGTH-1) { + $graphdata->{freqs}->{5}->{$i}->{substr($seqgd,$i,1)}++; + $graphdata->{freqs}->{3}->{$i}->{substr($seqgd,$length-$TAG_LENGTH+$i,1)}++; + } + } + #get kmers + if($length >= 5) { + unless($begin > 0 || $str5 eq 'CCCCC' || $str5 eq 'GGGGG' || $str5 eq 'NNNNN') { + $graphdata->{kmers}->{5}->{$str5}++; + } + unless($end > 0 || $str3 eq 'CCCCC' || $str3 eq 'GGGGG' || $str3 eq 'NNNNN') { + $graphdata->{kmers}->{3}->{$str3}++; + } + } + #check for MID tags + if($length >= $MIDCHECKLENGTH) { + $str5 = substr($seqgd,0,$MIDCHECKLENGTH); + foreach my $mid (keys %MIDS) { + if(index($str5,$mid) != -1) { + $graphdata->{mids}->{$mid}++; + last; + } + } + } + } + + #calculate sequence complexity + if($webstats{sc} || $graphstats{sc}) { + my ($rest,$steps,@dust,@entropy,$mean,$str,%counts,$num,$dustscore,$entropyval,$bynum); + if($length <= $WINDOWSIZE) { + $rest = $length; + $steps = 0; + } else { + $steps = int(($length - $WINDOWSIZE) / $WINDOWSTEP) + 1; + $rest = $length - $steps * $WINDOWSTEP; + unless($rest > $WINDOWSTEP) { + $rest += $WINDOWSTEP; + $steps--; + } + } + #dust and entropy + $num = $WINDOWSIZE-2; + $bynum = 1/$num; + $num--; + foreach my $i (0..$steps-1) { + $str = substr($seqgd,($i * $WINDOWSTEP),$WINDOWSIZE); + %counts = (); + foreach my $i (@WINDOWSIZEARRAY) { + $counts{substr($str,$i,3)}++; + } + #dust and entropy + $dustscore = $entropyval = 0; + foreach(values %counts) { + $dustscore += ($_ * ($_ - 1) * $POINTFIVE); + $entropyval -= ($_ * $bynum) * log($_ * $bynum); + } + push(@dust,($dustscore * $bynum)); + push(@entropy,($entropyval * $ONEOVERLOG62)); + } + #last step + if($rest > 5) { + $str = substr($seqgd,($steps * $WINDOWSTEP),$rest); + %counts = (); + $num = $rest-2; + foreach my $i (0..($num - 1)) { + $counts{substr($str,$i,3)}++; + } + $dustscore = $entropyval = 0; + $bynum = 1/$num; + foreach(values %counts) { + $dustscore += ($_ * ($_ - 1) * $POINTFIVE); + $entropyval -= ($_ * $bynum) * log($_ * $bynum); + } + push(@dust,(($dustscore / ($num-1)) * (($WINDOWSIZE - 2) / $num))); + push(@entropy,($entropyval / log($num))); + } else { + push(@dust,31); #to assign a maximum score based on the scaling factor 100/31 + push(@entropy,0); + } + + $mean = &getArrayMean(@dust); + $mean = int($mean * 100 / 31); #scale to 100 + $graphdata{compldust}->{$mean}++; + if(!exists $graphdata{complvals}->{dust}->{minval} || $graphdata{complvals}->{dust}->{minval} > $mean) { + $graphdata{complvals}->{dust}->{minval} = $mean; + $graphdata{complvals}->{dust}->{minseq} = $seqgd; + } + if(!exists $graphdata{complvals}->{dust}->{maxval} || $graphdata{complvals}->{dust}->{maxval} < $mean) { + $graphdata{complvals}->{dust}->{maxval} = $mean; + $graphdata{complvals}->{dust}->{maxseq} = $seqgd; + } + $mean = &getArrayMean(@entropy); + $mean = int($mean * 100); #scale to 100 + $graphdata{complentropy}->{$mean}++; + if(!exists $graphdata{complvals}->{entropy}->{minval} || $graphdata{complvals}->{entropy}->{minval} > $mean) { + $graphdata{complvals}->{entropy}->{minval} = $mean; + $graphdata{complvals}->{entropy}->{minseq} = $seqgd; + } + if(!exists $graphdata{complvals}->{entropy}->{maxval} || $graphdata{complvals}->{entropy}->{maxval} < $mean) { + $graphdata{complvals}->{entropy}->{maxval} = $mean; + $graphdata{complvals}->{entropy}->{maxseq} = $seqgd; + } + } + + #calculate dinucleotide odd ratios + if($webstats{dn} || $graphstats{dn}) { + &dinucOdds($seqgd,$length,\%{$graphdata{dinucodds}}); + } + + #store counts + if($webstats{ld} || $webstats{ld} || $graphstats{ld} || $graphstats{ld}) { + $graphdata->{counts}->{length}->{$length}++; + } + if($begin) { + $graphdata->{counts}->{tail5}->{$begin}++; + } + if($end) { + $graphdata->{counts}->{tail3}->{$end}++; + } + if($webstats{gc} || $graphstats{gc}) { + $graphdata->{counts}->{gc}->{$gc}++; + } + if($ns) { + $graphdata->{counts}->{ns}->{$ns}++; + } + + return 1; +} + +sub getQualStats { + my ($graphdata,$qual,$length) = @_; + + #check if quality values are available + return 0 unless($qual && ($webstats{qd} || $graphstats{qd})); + + #calculate decimal values of quals + my ($vals,$err); + if(exists $params{phred64}) { #scale data to Phred scale if necessary + ($vals,$err) = &convertQualAsciiToNumsPhred64($qual); + if($err) { + &printError("The sequence quality scores are not in Phred+64 format"); + } + } else { + $vals = &convertQualAsciiToNums($qual); + } + + #mean quality score + $graphdata->{qualsmean}->{int(&getArrayMean(@$vals))}++; + + my ($factor,$tmp,$count,$xmax,$bin,$tmpbin,$step); + + if($scale == 1) { #relative + #qual + if($length == 100) { + foreach my $i (0..99) { + $graphdata->{quals}->{$i}->{$vals->[$i]}++; + } + } elsif($length < 100) { #stretch + $factor = 100/$length; + foreach my $i (0..$length-1) { + $tmp = $vals->[$i]; + foreach my $j (int($i*$factor)..int(($i+1)*$factor)-1) { + $graphdata->{quals}->{$j}->{$tmp}++; + } + } + } elsif($length > 100) { #shrink + $factor = $length/100; + foreach my $i (0..99) { + $tmp = $count = 0; + foreach my $j (int($i*$factor)..int(($i+1)*$factor)-1) { + $tmp += $vals->[$j]; + $count++; + } + $graphdata->{quals}->{$i}->{int($tmp/$count)}++; + } + #my $piece = int($length/100); + #my $bypiece = 1/($piece+1); + #my $start = 0; + #my $end = 0; + #my $rest = ($length % 100) - 1; + #foreach my $i (0..$rest) { + # $end += $piece; + # $tmp = 0; + # foreach my $j ($start..$end) { + # $tmp += $vals->[$j]; + # } + # $graphdata{quals}->{$i}->{int($tmp * $bypiece)}++; + # $start = ++$end; + #} + #$rest++; + #$piece--; + #$bypiece = 1/($piece+1); + #foreach my $i ($rest..99) { + # $end += $piece; + # $tmp = 0; + # foreach my $j ($start..$end) { + # $tmp += $vals->[$j]; + # } + # $graphdata{quals}->{$i}->{int($tmp * $bypiece)}++; + # $start = ++$end; + #} + } + } + #absolute + foreach my $i (0..$length-1) { + $graphdata->{quala}->{$i}->{$vals->[$i]}++; + } +} + +sub getBinVal { + my $val = shift; + my $step; + if(!$val || $val <= 100) { + return 1; + } elsif($val < 10000) { + return int($val/100)+($val % 100 ? 1 : 0); + } elsif($val < 100000) { + return 1000; + } else { + $step = 1000000; + my $xmax = ($val % $step ? sprintf("%d",($val/$step+1))*$step : $val); + return ($xmax/100); + } +} + +sub convertStringToInt { + my $string = shift; + $string =~ s/(.)/sprintf("%x",ord($1))/eg; + return $string; +} + +sub getFileName { + my $str = shift; + $str =~ s/^.*\/([^\/]+)$/$1/; + return $str; +} diff -r 000000000000 -r 9790cfb46d03 prinseq.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/prinseq.xml Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,270 @@ + + (prinseq) + prinseq-lite.pl --version + + prinseq_perl_dependencies + PRINSEQ_SCRIPT_PATH + + + #import os + temp_graph_file = `mktemp`; + + perl \$PRINSEQ_SCRIPT_PATH/prinseq-lite.pl + #if $seq_type.seq_type_opt == 'single': + -fastq $seq_type.input_singles + #if $seq_type.input_singles.ext == 'fastqillumina': + -phred64 + #end if + #else: + -fastq $seq_type.input_mate1 + -fastq2 $seq_type.input_mate2 + #if $seq_type.input_mate1.ext != $seq_type.input_mate2.ext: + #import sys + #silent sys.stderr.write( 'Both pairs from your paired-end library need to be from the same filetype.' ) + #end if + #if $seq_type.input_mate1.ext == 'fastqillumina': + -phred64 + -endif + #end if + + -out_good 'trimmed_reads' + ## we do not use the filter options in prinseq, so we are not interested in reads + ## that do not pass the filters + -out_bad null + + ## Trim options + #if $trim_to_len: + -trim_to_len $trim_to_len + #end if + + #if $trim_left: + -trim_left $trim_left + #end if + + #if $trim_right: + -trim_right + #end if + + #if $trim_qual_left or $trim_qual_right: + -trim_qual_type $trim_qual_type + -trim_qual_rule $trim_qual_rule + -trim_qual_window $trim_qual_window + -trim_qual_step $trim_qual_step + #end if + + #if $trim_qual_left: + -trim_qual_left $trim_qual_left + #end if + + #if $trim_qual_right: + -trim_qual_right $trim_qual_right + #end if + + + -graph_stats #echo ','.join( $graph_stats )# + + ## summary are written to stdout + -stats_all + + + -graph_data $temp_graph_file + + ; + + perl \$PRINSEQ_SCRIPT_PATH/prinseq-graphs-noPCA.pl -i $temp_graph_file -html_all -o #echo os.path.join( $html_file.files_path, 'graphs' )# + + ; + + python \$PRINSEQ_SCRIPT_PATH/create_index.py $html_file.files_path > $html_file + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + seq_type['seq_type_opt'] == "single" + + + + seq_type['seq_type_opt'] == "paired" + + + + + + + + + + + + + + + seq_type['seq_type_opt'] == "paired" + + + + + + + + + + + + + + + + + + + + + + + + + + + +.. class:: warningmark + +**TIP** + +----- + +**What it does** + + +PRINSEQ is a tool that generates summary statistics of sequence and quality data and that is used to filter, reformat and trim next-generation sequence data. + + +http://prinseq.sourceforge.net/manual.html + + + ***** ORDER OF PROCESSING ***** + The available options are processed in the following order: + + seq_num, trim_left, trim_right, trim_left_p, trim_right_p, + trim_qual_left, trim_qual_right, trim_tail_left, + trim_tail_right, trim_ns_left, trim_ns_right, trim_to_len, + min_len, max_len, range_len, min_qual_score, max_qual_score, + min_qual_mean, max_qual_mean, min_gc, max_gc, range_gc, + ns_max_p, ns_max_n, noniupac, lc_method, derep, seq_id, + seq_case, dna_rna, out_format + + + + + + diff -r 000000000000 -r 9790cfb46d03 readme.rst --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/readme.rst Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,45 @@ +Galaxy wrapper for statistical hypothesis testing with scipy +============================================================ + +This wrapper is copyright 2013 by Björn Grüning. + + + + +============ +Installation +============ + + + +======= +History +======= + + + - v0.1: no release yet + + + + +Wrapper Licence (MIT/BSD style) +=============================== + +Permission to use, copy, modify, and distribute this software and its +documentation with or without modifications and for any purpose and +without fee is hereby granted, provided that any copyright notices +appear in all copies and that both those copyright notices and this +permission notice appear in supporting documentation, and that the +names of the contributors or copyright holders not be used in +advertising or publicity pertaining to distribution of the software +without specific prior permission. + +THE CONTRIBUTORS AND COPYRIGHT HOLDERS OF THIS SOFTWARE DISCLAIM ALL +WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED +WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL THE +CONTRIBUTORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY SPECIAL, INDIRECT +OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS +OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE +OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE +OR PERFORMANCE OF THIS SOFTWARE. + diff -r 000000000000 -r 9790cfb46d03 tool-data/example1.fasta --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/example1.fasta Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,24 @@ +>seq1 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT +>seq1_dupl1 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT +>seq1_dupl2 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT +>seq1_dupl3 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT +>seq2 length=200 +ACGTGACGTGACGTGGTGTACACAGAGATATATGAGACACACAGATAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTACACAGAGATATATGAGACACACAGATAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTAACCACGT +>seq3 length=100 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTATAGATTGCGTGCGTACGTGTGTGCATGCG +>seq3_dupl1 length=100 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTATAGATTGCGTGCGTACGTGTGTGCATGCG +>seq4 length=50 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTAGAGA +>seq5 length=100 Ns_begin=10 +NNNNNNNNNNTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGAGAAGAGGCGTGGAGGAGATGACACACCCCGTGTGTTCTC +>seq6 length=100 Ns_end=10 +TACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAANNNNNNNNNN +>seq7 length=50 +TACACCAGAGGTGTCTCTGTGTGGGTACACCAGAGGTGTCTCTGTGTGGG +>seq8 length=50 As=50 +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA \ No newline at end of file diff -r 000000000000 -r 9790cfb46d03 tool-data/example1.fastq --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/example1.fastq Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,48 @@ +@seq1 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ++seq1 length=100 +1234123412342134123412341234123412341234123412341234132412341234123412341234123412341234123412341234 +@seq1_dupl1 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ++seq1_dupl1 length=100 +1234123412342134123412341234123412341234123412341234132412341234123412341234123412341234123412341234 +@seq1_dupl2 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ++seq1_dupl2 length=100 +1234123412342134123412341234123412341234123412341234132412341234123412341234123412341234123412341234 +@seq1_dupl3 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT ++seq1_dupl3 length=100 +1234123412342134123412341234123412341234123412341234132412341234123412341234123412341234123412341234 +@seq2 length=200 +ACGTGACGTGACGTGGTGTACACAGAGATATATGAGACACACAGATAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTACACAGAGATATATGAGACACACAGATAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTAACCACGT ++seq2 length=200 +++++++++++++++++..............00000000000000000000333333333333333355555555555555555AAAAAAAAAAAAAA999999999999996666666666666666444444444444444442222222222222222222221111111)))))))))))))))>>>>>>>>>>>>> +@seq3 length=100 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTATAGATTGCGTGCGTACGTGTGTGCATGCG ++seq3 length=100 +FHHFFFFFFDDAA@====AAB===BBBBAAADDDDDDDDAAAADDDDD?????FFFFF??FFFFFFDA@AFFFFFFFFFFFFFFFFCCABA?>9:773.. +@seq3_dupl1 length=100 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTATAGATTGCGTGCGTACGTGTGTGCATGCG ++seq3_dupl1 length=100 +FHHFFFFFFDDAA@====AAB===BBBBAAADDDDDDDDAAAADDDDD?????FFFFF??FFFFFFDA@AFFFFFFFFFFFFFFFFCCABA?>9:773.. +@seq4 length=50 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTAGAGA ++seq4 length=50 +???CCDDBBBBBAA333:ABB=:::AGFFFFFHHHHFFFFFFFF??@FFF +@seq5 length=100 Ns_begin=10 +NNNNNNNNNNTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGAGAAGAGGCGTGGAGGAGATGACACACCCCGTGTGTTCTC ++seq5 length=100 Ns_begin=10 +!!!!!!!!!!99999999999999AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEFFFFFFFFFFFFFFFF +@seq6 length=100 Ns_end=10 +TACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAANNNNNNNNNN ++seq6 length=100 Ns_end=10 +FFFFFFFFFFFFFFFFFFEEEEEEEEEEEEEEEEEEEAAAAAAAAAAAAAAAAAAAAAAAAA9999999999999999999995555555!!!!!!!!!! +@seq7 length=50 +TACACCAGAGGTGTCTCTGTGTGGGTACACCAGAGGTGTCTCTGTGTGGG ++seq7 length=50 +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>> +@seq8 length=50 As=50 +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ++seq8 length=50 As=50 +55555555555555555555555555555555555555555555555555 \ No newline at end of file diff -r 000000000000 -r 9790cfb46d03 tool-data/example1.gd --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/example1.gd Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,3 @@ +#Graph data +#[prinseq-lite-0.20.3] [03/13/2013 12:11:59] Command: "perl prinseq-lite.pl -fastq example/example1.fastq -graph_data example/example1.gd -verbose -out_good null -out_bad null" +{"numseqs":12,"numbases":1150,"pairedend":0,"maxlength":200,"binval":2,"exactonly":0,"tagmidnum":0,"scale":1,"filename1":"6578616d706c65312e6661737471","format1":"fastq","counts":{"gc":{"50":5,"53":1,"0":1,"49":1,"56":2,"57":2},"length":{"50":3,"200":1,"100":8},"ns":{"10":2},"tail3":{"50":1},"tail5":{"50":1}},"stats":{"gc":{"p25":50,"p75":56,"mode":50,"min":0,"std":"14.83","range":58,"max":57,"modeval":5,"median":50,"mean":"48.17"},"length":{"p25":75,"p75":100,"mode":100,"min":50,"std":"37.96","range":151,"max":200,"modeval":8,"median":100,"mean":"95.83"},"ns":{"p25":10,"p75":10,"mode":10,"min":10,"std":"0.00","range":1,"max":10,"modeval":2,"median":10,"mean":"10.00"},"tail3":{"p25":50,"p75":50,"mode":50,"min":50,"std":"0.00","range":1,"max":50,"modeval":1,"median":50,"mean":"50.00"},"tail5":{"p25":50,"p75":50,"mode":50,"min":50,"std":"0.00","range":1,"max":50,"modeval":1,"median":50,"mean":"50.00"}},"quals":{"0":{"p25":13,"p75":33,"mode":16,"min":0,"std":"12.68","range":38,"max":37,"modeval":4,"median":16,"mean":"19.58"},"1":{"p25":13,"p75":33,"mode":17,"min":0,"std":"13.06","range":40,"max":39,"modeval":4,"median":17,"mean":"20.25"},"10":{"p25":18,"p75":35,"mode":18,"min":7,"std":"9.55","range":31,"max":37,"modeval":4,"median":19,"mean":"23.17"},"11":{"p25":19,"p75":32,"mode":19,"min":7,"std":"8.81","range":31,"max":37,"modeval":4,"median":19,"mean":"23.00"},"12":{"p25":17,"p75":32,"mode":17,"min":7,"std":"9.16","range":31,"max":37,"modeval":4,"median":18,"mean":"22.33"},"13":{"p25":16,"p75":31,"mode":16,"min":7,"std":"9.19","range":31,"max":37,"modeval":4,"median":18,"mean":"21.83"},"14":{"p25":18,"p75":28,"mode":18,"min":7,"std":"8.14","range":31,"max":37,"modeval":4,"median":19,"mean":"21.83"},"15":{"p25":19,"p75":28,"mode":19,"min":7,"std":"7.82","range":31,"max":37,"modeval":4,"median":19,"mean":"22.33"},"16":{"p25":16,"p75":28,"mode":16,"min":9,"std":"8.09","range":29,"max":37,"modeval":4,"median":18,"mean":"21.50"},"17":{"p25":17,"p75":28,"mode":17,"min":9,"std":"7.87","range":29,"max":37,"modeval":4,"median":18,"mean":"21.83"},"18":{"p25":18,"p75":32,"mode":18,"min":9,"std":"8.17","range":28,"max":36,"modeval":4,"median":19,"mean":"22.75"},"19":{"p25":19,"p75":32,"mode":19,"min":9,"std":"7.98","range":28,"max":36,"modeval":4,"median":19,"mean":"23.08"},"2":{"p25":14,"p75":33,"mode":18,"min":0,"std":"12.28","range":40,"max":39,"modeval":4,"median":18,"mean":"21.08"},"20":{"p25":16,"p75":33,"mode":16,"min":9,"std":"8.80","range":28,"max":36,"modeval":4,"median":18,"mean":"22.25"},"21":{"p25":17,"p75":28,"mode":17,"min":9,"std":"7.71","range":28,"max":36,"modeval":4,"median":18,"mean":"21.75"},"22":{"p25":18,"p75":28,"mode":18,"min":10,"std":"7.38","range":27,"max":36,"modeval":4,"median":19,"mean":"22.17"},"23":{"p25":19,"p75":28,"mode":19,"min":10,"std":"7.21","range":27,"max":36,"modeval":4,"median":19,"mean":"22.50"},"24":{"p25":16,"p75":32,"mode":16,"min":8,"std":"9.24","range":29,"max":36,"modeval":4,"median":18,"mean":"22.75"},"25":{"p25":17,"p75":32,"mode":17,"min":8,"std":"8.82","range":29,"max":36,"modeval":4,"median":19,"mean":"23.33"},"26":{"p25":18,"p75":32,"mode":18,"min":8,"std":"8.59","range":29,"max":36,"modeval":5,"median":19,"mean":"23.67"},"27":{"p25":19,"p75":32,"mode":19,"min":8,"std":"8.38","range":29,"max":36,"modeval":4,"median":19,"mean":"24.00"},"28":{"p25":16,"p75":32,"mode":16,"min":4,"std":"9.10","range":33,"max":36,"modeval":4,"median":18,"mean":"21.33"},"29":{"p25":17,"p75":32,"mode":17,"min":4,"std":"8.92","range":33,"max":36,"modeval":4,"median":18,"mean":"21.67"},"3":{"p25":14,"p75":33,"mode":19,"min":0,"std":"11.74","range":38,"max":37,"modeval":4,"median":19,"mean":"21.08"},"30":{"p25":18,"p75":32,"mode":18,"min":4,"std":"8.76","range":33,"max":36,"modeval":6,"median":18,"mean":"22.00"},"31":{"p25":18,"p75":33,"mode":19,"min":4,"std":"9.23","range":33,"max":36,"modeval":4,"median":19,"mean":"22.83"},"32":{"p25":16,"p75":33,"mode":16,"min":4,"std":"9.74","range":33,"max":36,"modeval":4,"median":18,"mean":"21.83"},"33":{"p25":17,"p75":33,"mode":17,"min":4,"std":"9.49","range":33,"max":36,"modeval":4,"median":19,"mean":"22.33"},"34":{"p25":18,"p75":33,"mode":18,"min":10,"std":"8.28","range":27,"max":36,"modeval":4,"median":20,"mean":"23.75"},"35":{"p25":19,"p75":33,"mode":19,"min":10,"std":"8.06","range":27,"max":36,"modeval":4,"median":20,"mean":"24.08"},"36":{"p25":16,"p75":33,"mode":16,"min":10,"std":"9.11","range":27,"max":36,"modeval":4,"median":20,"mean":"23.67"},"37":{"p25":17,"p75":32,"mode":17,"min":10,"std":"8.45","range":26,"max":35,"modeval":4,"median":20,"mean":"23.67"},"38":{"p25":18,"p75":32,"mode":18,"min":8,"std":"8.58","range":28,"max":35,"modeval":4,"median":20,"mean":"23.92"},"39":{"p25":19,"p75":32,"mode":19,"min":8,"std":"7.77","range":26,"max":33,"modeval":4,"median":20,"mean":"23.75"},"4":{"p25":13,"p75":33,"mode":16,"min":0,"std":"12.00","range":38,"max":37,"modeval":4,"median":16,"mean":"20.08"},"40":{"p25":16,"p75":32,"mode":16,"min":7,"std":"8.62","range":27,"max":33,"modeval":4,"median":20,"mean":"22.67"},"41":{"p25":17,"p75":32,"mode":17,"min":7,"std":"8.36","range":27,"max":33,"modeval":4,"median":23,"mean":"23.50"},"42":{"p25":18,"p75":32,"mode":32,"min":4,"std":"8.63","range":29,"max":32,"modeval":5,"median":24,"mean":"23.67"},"43":{"p25":19,"p75":32,"mode":19,"min":4,"std":"8.96","range":32,"max":35,"modeval":4,"median":24,"mean":"24.50"},"44":{"p25":16,"p75":32,"mode":16,"min":4,"std":"9.58","range":32,"max":35,"modeval":4,"median":22,"mean":"23.25"},"45":{"p25":17,"p75":32,"mode":17,"min":4,"std":"9.33","range":32,"max":35,"modeval":4,"median":22,"mean":"23.58"},"46":{"p25":18,"p75":32,"mode":18,"min":4,"std":"9.11","range":32,"max":35,"modeval":4,"median":22,"mean":"23.92"},"47":{"p25":19,"p75":32,"mode":19,"min":4,"std":"8.90","range":32,"max":35,"modeval":4,"median":22,"mean":"24.25"},"48":{"p25":16,"p75":30,"mode":16,"min":4,"std":"8.39","range":29,"max":32,"modeval":4,"median":22,"mean":"22.08"},"49":{"p25":17,"p75":30,"mode":17,"min":4,"std":"8.00","range":29,"max":32,"modeval":4,"median":22,"mean":"22.08"},"5":{"p25":13,"p75":33,"mode":17,"min":0,"std":"11.89","range":38,"max":37,"modeval":4,"median":17,"mean":"20.42"},"50":{"p25":18,"p75":31,"mode":18,"min":8,"std":"7.50","range":25,"max":32,"modeval":4,"median":22,"mean":"23.33"},"51":{"p25":19,"p75":31,"mode":19,"min":8,"std":"7.27","range":25,"max":32,"modeval":4,"median":22,"mean":"23.67"},"52":{"p25":16,"p75":31,"mode":16,"min":13,"std":"8.10","range":26,"max":38,"modeval":4,"median":22,"mean":"23.58"},"53":{"p25":18,"p75":34,"mode":18,"min":13,"std":"8.75","range":26,"max":38,"modeval":4,"median":22,"mean":"25.42"},"54":{"p25":17,"p75":34,"mode":17,"min":16,"std":"8.62","range":22,"max":37,"modeval":4,"median":22,"mean":"25.25"},"55":{"p25":19,"p75":36,"mode":19,"min":16,"std":"8.40","range":22,"max":37,"modeval":4,"median":21,"mean":"26.08"},"56":{"p25":16,"p75":36,"mode":16,"min":9,"std":"10.08","range":29,"max":37,"modeval":4,"median":20,"mean":"24.42"},"57":{"p25":17,"p75":36,"mode":17,"min":9,"std":"9.81","range":29,"max":37,"modeval":4,"median":20,"mean":"24.75"},"58":{"p25":18,"p75":31,"mode":18,"min":9,"std":"8.38","range":29,"max":37,"modeval":4,"median":20,"mean":"23.92"},"59":{"p25":19,"p75":31,"mode":19,"min":9,"std":"8.16","range":29,"max":37,"modeval":4,"median":20,"mean":"24.25"},"6":{"p25":14,"p75":35,"mode":18,"min":0,"std":"11.83","range":38,"max":37,"modeval":4,"median":18,"mean":"21.33"},"60":{"p25":16,"p75":36,"mode":16,"min":9,"std":"10.08","range":29,"max":37,"modeval":4,"median":20,"mean":"24.42"},"61":{"p25":17,"p75":36,"mode":17,"min":9,"std":"9.81","range":29,"max":37,"modeval":4,"median":20,"mean":"24.75"},"62":{"p25":18,"p75":36,"mode":18,"min":12,"std":"8.94","range":26,"max":37,"modeval":4,"median":20,"mean":"24.67"},"63":{"p25":19,"p75":36,"mode":19,"min":12,"std":"8.74","range":26,"max":37,"modeval":4,"median":20,"mean":"24.92"},"64":{"p25":16,"p75":36,"mode":16,"min":10,"std":"10.00","range":30,"max":39,"modeval":4,"median":19,"mean":"23.83"},"65":{"p25":17,"p75":36,"mode":17,"min":10,"std":"9.75","range":30,"max":39,"modeval":4,"median":19,"mean":"24.17"},"66":{"p25":18,"p75":35,"mode":18,"min":9,"std":"9.22","range":31,"max":39,"modeval":4,"median":19,"mean":"24.08"},"67":{"p25":19,"p75":32,"mode":19,"min":9,"std":"8.48","range":31,"max":39,"modeval":5,"median":19,"mean":"23.92"},"68":{"p25":16,"p75":31,"mode":16,"min":6,"std":"9.40","range":34,"max":39,"modeval":4,"median":19,"mean":"22.50"},"69":{"p25":17,"p75":32,"mode":17,"min":6,"std":"9.34","range":34,"max":39,"modeval":4,"median":19,"mean":"23.00"},"7":{"p25":14,"p75":35,"mode":19,"min":0,"std":"11.74","range":38,"max":37,"modeval":4,"median":19,"mean":"21.67"},"70":{"p25":18,"p75":36,"mode":18,"min":6,"std":"10.07","range":34,"max":39,"modeval":4,"median":19,"mean":"24.17"},"71":{"p25":19,"p75":36,"mode":19,"min":6,"std":"9.87","range":34,"max":39,"modeval":5,"median":19,"mean":"24.50"},"72":{"p25":16,"p75":36,"mode":16,"min":8,"std":"10.09","range":30,"max":37,"modeval":4,"median":18,"mean":"23.33"},"73":{"p25":17,"p75":36,"mode":17,"min":8,"std":"9.86","range":30,"max":37,"modeval":5,"median":18,"mean":"23.67"},"74":{"p25":18,"p75":36,"mode":18,"min":8,"std":"9.64","range":30,"max":37,"modeval":4,"median":19,"mean":"24.00"},"75":{"p25":19,"p75":36,"mode":19,"min":8,"std":"9.45","range":30,"max":37,"modeval":4,"median":19,"mean":"24.33"},"76":{"p25":16,"p75":36,"mode":16,"min":9,"std":"9.97","range":29,"max":37,"modeval":4,"median":18,"mean":"23.42"},"77":{"p25":17,"p75":36,"mode":17,"min":9,"std":"9.73","range":29,"max":37,"modeval":5,"median":18,"mean":"23.75"},"78":{"p25":18,"p75":36,"mode":18,"min":9,"std":"9.51","range":29,"max":37,"modeval":4,"median":19,"mean":"24.08"},"79":{"p25":19,"p75":36,"mode":19,"min":9,"std":"9.30","range":29,"max":37,"modeval":4,"median":19,"mean":"24.42"},"8":{"p25":14,"p75":35,"mode":16,"min":0,"std":"12.04","range":38,"max":37,"modeval":4,"median":16,"mean":"20.75"},"80":{"p25":16,"p75":36,"mode":16,"min":16,"std":"9.07","range":22,"max":37,"modeval":4,"median":20,"mean":"24.33"},"81":{"p25":17,"p75":36,"mode":17,"min":17,"std":"8.77","range":21,"max":37,"modeval":5,"median":20,"mean":"24.67"},"82":{"p25":18,"p75":36,"mode":18,"min":16,"std":"8.57","range":22,"max":37,"modeval":4,"median":20,"mean":"24.92"},"83":{"p25":19,"p75":36,"mode":19,"min":16,"std":"8.43","range":22,"max":37,"modeval":4,"median":20,"mean":"24.92"},"84":{"p25":16,"p75":37,"mode":16,"min":16,"std":"9.62","range":22,"max":37,"modeval":5,"median":20,"mean":"25.17"},"85":{"p25":17,"p75":37,"mode":17,"min":16,"std":"9.31","range":22,"max":37,"modeval":4,"median":20,"mean":"25.50"},"86":{"p25":18,"p75":34,"mode":18,"min":8,"std":"9.42","range":30,"max":37,"modeval":4,"median":20,"mean":"24.67"},"87":{"p25":19,"p75":34,"mode":19,"min":8,"std":"9.19","range":30,"max":37,"modeval":4,"median":20,"mean":"25.00"},"88":{"p25":16,"p75":32,"mode":16,"min":8,"std":"9.31","range":30,"max":37,"modeval":4,"median":20,"mean":"23.33"},"89":{"p25":17,"p75":33,"mode":17,"min":8,"std":"9.22","range":30,"max":37,"modeval":4,"median":20,"mean":"23.83"},"9":{"p25":15,"p75":34,"mode":17,"min":0,"std":"11.48","range":38,"max":37,"modeval":4,"median":17,"mean":"20.75"},"90":{"p25":18,"p75":31,"mode":18,"min":0,"std":"10.35","range":38,"max":37,"modeval":4,"median":19,"mean":"21.67"},"91":{"p25":19,"p75":30,"mode":19,"min":0,"std":"9.94","range":38,"max":37,"modeval":4,"median":19,"mean":"21.67"},"92":{"p25":16,"p75":29,"mode":16,"min":0,"std":"10.23","range":38,"max":37,"modeval":4,"median":18,"mean":"20.58"},"93":{"p25":17,"p75":26,"mode":17,"min":0,"std":"8.91","range":38,"max":37,"modeval":4,"median":19,"mean":"20.92"},"94":{"p25":18,"p75":29,"mode":18,"min":0,"std":"9.62","range":38,"max":37,"modeval":4,"median":22,"mean":"22.83"},"95":{"p25":19,"p75":29,"mode":19,"min":0,"std":"9.43","range":38,"max":37,"modeval":4,"median":21,"mean":"22.67"},"96":{"p25":16,"p75":29,"mode":16,"min":0,"std":"9.91","range":38,"max":37,"modeval":4,"median":21,"mean":"21.67"},"97":{"p25":17,"p75":29,"mode":17,"min":0,"std":"9.84","range":38,"max":37,"modeval":4,"median":18,"mean":"21.33"},"98":{"p25":15,"p75":29,"mode":18,"min":0,"std":"10.19","range":38,"max":37,"modeval":4,"median":18,"mean":"20.83"},"99":{"p25":16,"p75":29,"mode":19,"min":0,"std":"10.11","range":38,"max":37,"modeval":4,"median":19,"mean":"21.17"}},"qualsbin":{"0":{"p25":13,"p75":33,"mode":37,"min":0,"std":"12.46","range":40,"max":39,"modeval":5,"median":18,"mean":"20.47"},"1":{"p25":13,"p75":35,"mode":37,"min":0,"std":"12.00","range":38,"max":37,"modeval":6,"median":19,"mean":"21.08"},"10":{"p25":17,"p75":28,"mode":28,"min":4,"std":"8.00","range":33,"max":36,"modeval":5,"median":19,"mean":"20.79"},"11":{"p25":16,"p75":28,"mode":16,"min":4,"std":"8.70","range":33,"max":36,"modeval":4,"median":19,"mean":"21.42"},"12":{"p25":17,"p75":33,"mode":17,"min":8,"std":"9.10","range":31,"max":38,"modeval":4,"median":19,"mean":"23.54"},"13":{"p25":16,"p75":32,"mode":16,"min":9,"std":"9.04","range":29,"max":37,"modeval":5,"median":19,"mean":"23.79"},"14":{"p25":17,"p75":32,"mode":32,"min":9,"std":"9.19","range":29,"max":37,"modeval":6,"median":19,"mean":"23.50"},"15":{"p25":16,"p75":35,"mode":16,"min":10,"std":"9.53","range":30,"max":39,"modeval":4,"median":19,"mean":"24.33"},"16":{"p25":17,"p75":35,"mode":17,"min":6,"std":"10.07","range":34,"max":39,"modeval":4,"median":19,"mean":"24.12"},"17":{"p25":16,"p75":35,"mode":16,"min":6,"std":"10.05","range":34,"max":39,"modeval":4,"median":19,"mean":"24.00"},"18":{"p25":17,"p75":33,"mode":17,"min":8,"std":"9.33","range":30,"max":37,"modeval":4,"median":19,"mean":"23.71"},"19":{"p25":16,"p75":32,"mode":32,"min":9,"std":"8.26","range":29,"max":37,"modeval":8,"median":20,"mean":"23.71"},"2":{"p25":13,"p75":36,"mode":37,"min":0,"std":"12.15","range":38,"max":37,"modeval":6,"median":18,"mean":"21.08"},"20":{"p25":17,"p75":32,"mode":32,"min":15,"std":"7.87","range":23,"max":37,"modeval":8,"median":20,"mean":"24.75"},"21":{"p25":17,"p75":34,"mode":16,"min":15,"std":"8.41","range":23,"max":37,"modeval":4,"median":25,"mean":"25.67"},"22":{"p25":17,"p75":32,"mode":17,"min":15,"std":"7.61","range":21,"max":35,"modeval":4,"median":24,"mean":"24.88"},"23":{"p25":17,"p75":32,"mode":16,"min":15,"std":"7.86","range":23,"max":37,"modeval":4,"median":24,"mean":"25.00"},"24":{"p25":18,"p75":30,"mode":18,"min":15,"std":"7.13","range":23,"max":37,"modeval":5,"median":20,"mean":"24.14"},"25":{"p25":18,"p75":30,"mode":16,"min":16,"std":"6.77","range":17,"max":32,"modeval":4,"median":19,"mean":"23.56"},"26":{"p25":18,"p75":32,"mode":18,"min":17,"std":"8.57","range":21,"max":37,"modeval":6,"median":18,"mean":"25.11"},"27":{"p25":18,"p75":36,"mode":16,"min":16,"std":"9.06","range":22,"max":37,"modeval":4,"median":19,"mean":"25.56"},"28":{"p25":18,"p75":32,"mode":18,"min":17,"std":"8.26","range":21,"max":37,"modeval":6,"median":18,"mean":"24.78"},"29":{"p25":18,"p75":32,"mode":16,"min":16,"std":"8.31","range":22,"max":37,"modeval":4,"median":19,"mean":"24.78"},"3":{"p25":13,"p75":35,"mode":37,"min":0,"std":"11.91","range":38,"max":37,"modeval":6,"median":19,"mean":"21.00"},"30":{"p25":18,"p75":36,"mode":18,"min":17,"std":"8.87","range":21,"max":37,"modeval":6,"median":18,"mean":"25.11"},"31":{"p25":18,"p75":36,"mode":16,"min":16,"std":"8.77","range":22,"max":37,"modeval":4,"median":19,"mean":"24.67"},"32":{"p25":18,"p75":35,"mode":18,"min":17,"std":"8.35","range":21,"max":37,"modeval":5,"median":19,"mean":"24.56"},"33":{"p25":19,"p75":31,"mode":16,"min":16,"std":"7.07","range":21,"max":36,"modeval":4,"median":20,"mean":"23.67"},"34":{"p25":18,"p75":32,"mode":17,"min":17,"std":"7.90","range":21,"max":37,"modeval":4,"median":20,"mean":"24.33"},"35":{"p25":19,"p75":36,"mode":16,"min":16,"std":"8.62","range":22,"max":37,"modeval":4,"median":20,"mean":"24.89"},"36":{"p25":18,"p75":36,"mode":17,"min":17,"std":"8.57","range":21,"max":37,"modeval":4,"median":20,"mean":"24.89"},"37":{"p25":19,"p75":36,"mode":16,"min":16,"std":"8.62","range":22,"max":37,"modeval":4,"median":20,"mean":"24.89"},"38":{"p25":18,"p75":36,"mode":17,"min":17,"std":"8.57","range":21,"max":37,"modeval":4,"median":20,"mean":"24.89"},"39":{"p25":19,"p75":36,"mode":16,"min":16,"std":"8.62","range":22,"max":37,"modeval":4,"median":20,"mean":"24.89"},"4":{"p25":17,"p75":34,"mode":17,"min":0,"std":"10.51","range":38,"max":37,"modeval":4,"median":18,"mean":"21.75"},"40":{"p25":18,"p75":36,"mode":17,"min":17,"std":"8.57","range":21,"max":37,"modeval":4,"median":20,"mean":"24.89"},"41":{"p25":19,"p75":37,"mode":37,"min":16,"std":"8.96","range":22,"max":37,"modeval":5,"median":20,"mean":"25.83"},"42":{"p25":18,"p75":34,"mode":17,"min":17,"std":"8.60","range":21,"max":37,"modeval":4,"median":20,"mean":"25.56"},"43":{"p25":19,"p75":32,"mode":16,"min":16,"std":"8.04","range":22,"max":37,"modeval":4,"median":20,"mean":"25.00"},"44":{"p25":17,"p75":32,"mode":17,"min":0,"std":"9.68","range":38,"max":37,"modeval":4,"median":19,"mean":"23.78"},"45":{"p25":16,"p75":30,"mode":16,"min":0,"std":"10.60","range":38,"max":37,"modeval":4,"median":19,"mean":"22.00"},"46":{"p25":17,"p75":25,"mode":17,"min":0,"std":"9.96","range":38,"max":37,"modeval":4,"median":18,"mean":"20.89"},"47":{"p25":16,"p75":22,"mode":16,"min":0,"std":"9.85","range":38,"max":37,"modeval":4,"median":19,"mean":"20.33"},"48":{"p25":17,"p75":18,"mode":18,"min":0,"std":"9.12","range":38,"max":37,"modeval":6,"median":18,"mean":"18.00"},"49":{"p25":13,"p75":24,"mode":19,"min":0,"std":"8.98","range":38,"max":37,"modeval":4,"median":19,"mean":"18.70"},"5":{"p25":17,"p75":32,"mode":32,"min":8,"std":"8.77","range":30,"max":37,"modeval":5,"median":19,"mean":"22.38"},"50":{"p25":24,"p75":24,"mode":24,"min":24,"std":"0.00","range":1,"max":24,"modeval":2,"median":24,"mean":"24.00"},"51":{"p25":24,"p75":24,"mode":24,"min":24,"std":"0.00","range":1,"max":24,"modeval":2,"median":24,"mean":"24.00"},"52":{"p25":24,"p75":24,"mode":24,"min":24,"std":"0.00","range":1,"max":24,"modeval":2,"median":24,"mean":"24.00"},"53":{"p25":24,"p75":24,"mode":24,"min":24,"std":"0.00","range":1,"max":24,"modeval":2,"median":24,"mean":"24.00"},"54":{"p25":24,"p75":24,"mode":24,"min":24,"std":"0.00","range":1,"max":24,"modeval":2,"median":24,"mean":"24.00"},"55":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"56":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"57":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"58":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"59":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"6":{"p25":16,"p75":28,"mode":18,"min":4,"std":"8.65","range":34,"max":37,"modeval":5,"median":18,"mean":"20.75"},"60":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"61":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"62":{"p25":21,"p75":21,"mode":21,"min":21,"std":"0.00","range":1,"max":21,"modeval":2,"median":21,"mean":"21.00"},"63":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"64":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"65":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"66":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"67":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"68":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"69":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"7":{"p25":16,"p75":26,"mode":16,"min":4,"std":"8.17","range":34,"max":37,"modeval":4,"median":19,"mean":"20.04"},"70":{"p25":19,"p75":19,"mode":19,"min":19,"std":"0.00","range":1,"max":19,"modeval":2,"median":19,"mean":"19.00"},"71":{"p25":17,"p75":19,"mode":17,"min":17,"std":"1.00","range":3,"max":19,"modeval":1,"median":18,"mean":"18.00"},"72":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"73":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"74":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"75":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"76":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"77":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"78":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"79":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"8":{"p25":17,"p75":28,"mode":17,"min":10,"std":"7.72","range":28,"max":37,"modeval":4,"median":19,"mean":"21.83"},"80":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"81":{"p25":17,"p75":17,"mode":17,"min":17,"std":"0.00","range":1,"max":17,"modeval":2,"median":17,"mean":"17.00"},"82":{"p25":16,"p75":16,"mode":16,"min":16,"std":"0.00","range":1,"max":16,"modeval":2,"median":16,"mean":"16.00"},"83":{"p25":16,"p75":16,"mode":16,"min":16,"std":"0.00","range":1,"max":16,"modeval":2,"median":16,"mean":"16.00"},"84":{"p25":16,"p75":16,"mode":16,"min":16,"std":"0.00","range":1,"max":16,"modeval":2,"median":16,"mean":"16.00"},"85":{"p25":8,"p75":16,"mode":8,"min":8,"std":"4.00","range":9,"max":16,"modeval":1,"median":12,"mean":"12.00"},"86":{"p25":8,"p75":8,"mode":8,"min":8,"std":"0.00","range":1,"max":8,"modeval":2,"median":8,"mean":"8.00"},"87":{"p25":8,"p75":8,"mode":8,"min":8,"std":"0.00","range":1,"max":8,"modeval":2,"median":8,"mean":"8.00"},"88":{"p25":8,"p75":8,"mode":8,"min":8,"std":"0.00","range":1,"max":8,"modeval":2,"median":8,"mean":"8.00"},"89":{"p25":8,"p75":8,"mode":8,"min":8,"std":"0.00","range":1,"max":8,"modeval":2,"median":8,"mean":"8.00"},"9":{"p25":16,"p75":32,"mode":16,"min":7,"std":"8.78","range":30,"max":36,"modeval":4,"median":19,"mean":"22.38"},"90":{"p25":8,"p75":8,"mode":8,"min":8,"std":"0.00","range":1,"max":8,"modeval":2,"median":8,"mean":"8.00"},"91":{"p25":8,"p75":8,"mode":8,"min":8,"std":"0.00","range":1,"max":8,"modeval":2,"median":8,"mean":"8.00"},"92":{"p25":8,"p75":8,"mode":8,"min":8,"std":"0.00","range":1,"max":8,"modeval":2,"median":8,"mean":"8.00"},"93":{"p25":29,"p75":29,"mode":29,"min":29,"std":"0.00","range":1,"max":29,"modeval":2,"median":29,"mean":"29.00"},"94":{"p25":29,"p75":29,"mode":29,"min":29,"std":"0.00","range":1,"max":29,"modeval":2,"median":29,"mean":"29.00"},"95":{"p25":29,"p75":29,"mode":29,"min":29,"std":"0.00","range":1,"max":29,"modeval":2,"median":29,"mean":"29.00"},"96":{"p25":29,"p75":29,"mode":29,"min":29,"std":"0.00","range":1,"max":29,"modeval":2,"median":29,"mean":"29.00"},"97":{"p25":29,"p75":29,"mode":29,"min":29,"std":"0.00","range":1,"max":29,"modeval":2,"median":29,"mean":"29.00"},"98":{"p25":29,"p75":29,"mode":29,"min":29,"std":"0.00","range":1,"max":29,"modeval":2,"median":29,"mean":"29.00"},"99":{"p25":29,"p75":29,"mode":29,"min":29,"std":"0.00","range":1,"max":29,"modeval":1,"median":29,"mean":"29.00"}},"complvals":{"dust":{"maxval":100,"minseq":"NNNNNNNNNNTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGAGAAGAGGCGTGGAGGAGATGACACACCCCGTGTGTTCTC","minval":2,"maxseq":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA"},"entropy":{"maxval":81,"minseq":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA","minval":0,"maxseq":"NNNNNNNNNNTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGAGAAGAGGCGTGGAGGAGATGACACACCCCGTGTGTTCTC"}},"dubscounts":{"1":{"0":1},"3":{"0":1}},"dubslength":{"100":{"0":4}},"qualsmean":{"27":1,"11":1,"32":1,"33":2,"18":1,"29":1,"17":4,"20":1},"tagprob":{"3":33,"5":33},"compldust":{"4":2,"3":2,"23":4,"100":1,"2":1,"5":2},"complentropy":{"81":1,"35":4,"77":1,"0":1,"73":3,"76":2},"dinucodds":{"AATT":"0.395437960","AT":"0.362329006","AGCT":"0.485808170","CATG":"1.073571725","GATC":"0.433830048","CCGG":"0.300572486","CG":"1.916620315","TA":"1.833186173","GC":"0.582918467","ACGT":"2.437785575"},"tail":1,"freqs":{"3":{"11":{"A":8,"T":50,"N":8,"C":16,"G":16},"7":{"A":16,"T":33,"N":0,"C":25,"G":25},"17":{"A":16,"T":0,"N":8,"C":50,"G":25},"2":{"A":8,"T":16,"N":0,"C":33,"G":41},"1":{"A":25,"T":8,"N":0,"C":33,"G":33},"18":{"A":8,"T":8,"N":8,"C":16,"G":58},"0":{"A":41,"T":33,"N":0,"C":16,"G":8},"16":{"A":50,"T":33,"C":0,"N":8,"G":8},"13":{"A":16,"T":8,"N":8,"C":33,"G":33},"6":{"A":16,"T":8,"N":0,"C":25,"G":50},"3":{"A":25,"T":41,"N":0,"C":0,"G":33},"9":{"A":16,"T":8,"N":0,"C":58,"G":16},"12":{"A":50,"T":33,"C":0,"N":8,"G":8},"14":{"A":8,"T":16,"N":8,"C":25,"G":41},"15":{"A":33,"T":41,"N":8,"C":8,"G":8},"8":{"A":50,"T":25,"N":0,"C":8,"G":16},"4":{"A":41,"T":16,"N":0,"C":25,"G":16},"19":{"A":16,"T":41,"N":8,"C":8,"G":25},"10":{"A":8,"T":33,"C":0,"N":8,"G":50},"5":{"A":33,"T":8,"N":0,"C":50,"G":8}},"5":{"11":{"A":16,"T":50,"N":0,"C":33,"G":0},"7":{"A":8,"T":33,"N":8,"C":25,"G":25},"17":{"A":8,"T":41,"N":0,"C":33,"G":16},"2":{"A":8,"T":0,"C":16,"N":8,"G":66},"1":{"A":50,"T":0,"N":8,"C":41,"G":0},"18":{"A":16,"T":8,"N":0,"C":0,"G":75},"0":{"A":50,"T":41,"C":0,"N":8,"G":0},"16":{"A":50,"T":8,"N":0,"C":16,"G":25},"13":{"A":16,"T":50,"N":0,"C":33,"G":0},"6":{"A":25,"T":0,"N":8,"C":8,"G":58},"3":{"A":50,"T":41,"C":0,"N":8,"G":0},"9":{"A":8,"T":25,"N":8,"C":33,"G":25},"12":{"A":41,"T":0,"N":0,"C":8,"G":50},"14":{"A":33,"T":0,"N":0,"C":25,"G":41},"15":{"A":8,"T":50,"N":0,"C":33,"G":8},"8":{"A":58,"T":8,"C":0,"N":8,"G":25},"4":{"A":41,"T":25,"C":16,"N":8,"G":8},"19":{"A":16,"T":75,"N":0,"C":0,"G":8},"10":{"A":16,"T":8,"N":0,"C":0,"G":75},"5":{"A":16,"T":25,"N":8,"C":50,"G":0}}}} \ No newline at end of file diff -r 000000000000 -r 9790cfb46d03 tool-data/example1.html --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/example1.html Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,7035 @@ + + + + +PRINSEQ-graphs Report + + + +

PRINSEQ-graphs v0.6 HTML Report   

[Generated: 03/13/2013 12:12:36]

Input Information
Input file(s):example1.fastq
Input format(s):FASTQ
# Sequences:12
Total bases:1,150

Length Distribution
Mean sequence length: 95.83 ± 37.96 bp
Minimum length: 50 bp
Maximum length:200 bp
Length range:151 bp
Mode length: 100 bp with 8 sequences

+

GC Content Distribution
Mean GC content: 48.17 ± 14.83 %
Minimum GC content: 0 %
Maximum GC content: 57 %
GC content range: 58 %
Mode GC content: 50 % with 5 sequences

+

Base Quality Distribution
+

+

+

Occurence of N
Sequences with N: 2  (16.67 %)
Max percentage of Ns per sequence: 10 %

+

Poly-A/T Tails
5'-end 3'-end
Sequences with tail:1  (8.33 %) 1  (8.33 %)
Maximum tail length: 50 50

+

+

Tag Sequence Check
5'-end3'-end
Probability of tag sequence:33 %33 %
GSMIDs or RLMIDs:none 

+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
 ...  +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
        10     15     20  20     15     10        
 Position from Sequence Ends

Sequence Duplication
# Sequences Max duplicates
Exact duplicates:4  (33.33 %)3
Exact duplicates with reverse complements:0 0
5' duplicates0 0
3' duplicates0 0
5'/3' duplicates with reverse complements0 0
Total:4  (33.33 %)-

+

+

+

Sequence Complexity
ValueSequence
Minimum DUST score:2NNNNNNNNNNTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTG
AGAAGAGGCGTGGAGGAGATGACACACCCCGTGTGTTCTC
Maximum DUST score:100AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Minimum Entropy value:0AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
Maximum Entropy value:81NNNNNNNNNNTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTG
AGAAGAGGCGTGGAGGAGATGACACACCCCGTGTGTTCTC

+

+

Dinucleotide Odds Ratios
 AA/TTAC/GTAG/CTATCA/TGCC/GGCGGA/TCGCTA
Odds ratio0.39542.43780.48580.36231.07360.30061.91660.43380.58291.8332

+

+

+
\ No newline at end of file diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_cd.png Binary file tool-data/example1_cd.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_ce.png Binary file tool-data/example1_ce.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_df.png Binary file tool-data/example1_df.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_dl.png Binary file tool-data/example1_dl.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_dm.png Binary file tool-data/example1_dm.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_gc.png Binary file tool-data/example1_gc.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_ld.png Binary file tool-data/example1_ld.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_ns.png Binary file tool-data/example1_ns.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_or.png Binary file tool-data/example1_or.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_pm.png Binary file tool-data/example1_pm.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_pv.png Binary file tool-data/example1_pv.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_qd.png Binary file tool-data/example1_qd.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_qd2.png Binary file tool-data/example1_qd2.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_qd3.png Binary file tool-data/example1_qd3.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_td3.png Binary file tool-data/example1_td3.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_td5.png Binary file tool-data/example1_td5.png has changed diff -r 000000000000 -r 9790cfb46d03 tool-data/example1_trim_right_10.fastq --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/example1_trim_right_10.fastq Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,48 @@ +@seq1 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC ++seq1 length=100 +123412341234213412341234123412341234123412341234123413241234123412341234123412341234123412 +@seq1_dupl1 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC ++seq1_dupl1 length=100 +123412341234213412341234123412341234123412341234123413241234123412341234123412341234123412 +@seq1_dupl2 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC ++seq1_dupl2 length=100 +123412341234213412341234123412341234123412341234123413241234123412341234123412341234123412 +@seq1_dupl3 length=100 +ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTAC ++seq1_dupl3 length=100 +123412341234213412341234123412341234123412341234123413241234123412341234123412341234123412 +@seq2 length=200 +ACGTGACGTGACGTGGTGTACACAGAGATATATGAGACACACAGATAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTACACAGAGATATATGAGACACACAGATAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGT ++seq2 length=200 +++++++++++++++++..............00000000000000000000333333333333333355555555555555555AAAAAAAAAAAAAA999999999999996666666666666666444444444444444442222222222222222222221111111)))))))))))))))>>> +@seq3 length=100 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTATAGATTGCGTGCGTACGTG ++seq3 length=100 +FHHFFFFFFDDAA@====AAB===BBBBAAADDDDDDDDAAAADDDDD?????FFFFF??FFFFFFDA@AFFFFFFFFFFFFFFFFCCAB +@seq3_dupl1 length=100 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGCTCTGTGCGTGTACGTGACGTGACGTGGTGTATAGATTGCGTGCGTACGTG ++seq3_dupl1 length=100 +FHHFFFFFFDDAA@====AAB===BBBBAAADDDDDDDDAAAADDDDD?????FFFFF??FFFFFFDA@AFFFFFFFFFFFFFFFFCCAB +@seq4 length=50 +TAGATTGCGTGCGTACGTGTGTGCATGCGTTGTGCCGCGC ++seq4 length=50 +???CCDDBBBBBAA333:ABB=:::AGFFFFFHHHHFFFF +@seq5 length=100 Ns_begin=10 +NNNNNNNNNNTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGAGAAGAGGCGTGGAGGAGATGACACACCCC ++seq5 length=100 Ns_begin=10 +!!!!!!!!!!99999999999999AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAEEEEEEEEEEEEEEEEEEEEEEEEEEEEEFFFFFF +@seq6 length=100 Ns_end=10 +TACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAAGTGAGAGTTGTACACCAGAGGTGTCTCTGTGTGGGGCCTGTGTGCCAAAA ++seq6 length=100 Ns_end=10 +FFFFFFFFFFFFFFFFFFEEEEEEEEEEEEEEEEEEEAAAAAAAAAAAAAAAAAAAAAAAAA9999999999999999999995555555 +@seq7 length=50 +TACACCAGAGGTGTCTCTGTGTGGGTACACCAGAGGTGTC ++seq7 length=50 +!''*((((***+))%%%++)(%%%%).1***-+*''))** +@seq8 length=50 As=50 +AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA ++seq8 length=50 As=50 +5555555555555555555555555555555555555555 diff -r 000000000000 -r 9790cfb46d03 tool-data/example_readme.txt --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool-data/example_readme.txt Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,36 @@ +This file will contain example commands for newly introduced options. (If you want to copy and paste the commands, do not copy the $-sign.) +For more examples and information, take a look at the Manual (http://prinseq.sourceforge.net/manual.html). + + +(1) Graph data +============== +To generate the graphs from the web version or the HTML report, you can use the -graph_data option: +$ perl prinseq-lite.pl -verbose -fastq example1.fastq -graph_data example1.gd -out_good null -out_bad null + +The verbose mode shows the progress and the "-out_good null -out_bad null" prevents PRINSEQ from generating any other output files than the specified test.gd file containing the graphs data. + +To generate the graph as PNG files, you can use the prinseq-graphs -png_all option: +$ perl prinseq-graphs.pl -i example1.gd -png_all -o example1 + +To generate the HTML report containing all the tables and figures from the web version, you can use the prinseq-graphs -html_all option: +$ perl prinseq-graphs.pl -i example1.gd -html_all -o example1 + + +(2) Consider exact duplicates only +================================== +When you process large files, the duplicate removal will require a lot of memory. To reduce the amount of memory required and speed up the process (at the cost of only removing forward and reverse exact duplicates), you can use the option -exact_only when generating the graphs data: +$ perl prinseq-lite.pl -verbose -fastq example1.fastq -graph_data example1.gd -out_good null -out_bad null -exact_only + +Note that for processing the data, if you specify -derep 1, -derep 4, or -derep 14 then the exact_only option will be used automatically. + + +(3) Duplicate threshold and no quality header information +========================================================= +Process the data (-fastq example1.fastq) with status report (-verbose), remove exact sequence duplicates (-derep 1) that occur more than 2 times (-derep_min 3) and save the sequences passing the filter in example1_good.fastq (-out_good example1_good) without the quality header (-no_qual_header) and the filtered sequences in example1_bad.fastq (-out_bad example1_bad): +$ perl prinseq-lite.pl -verbose -fastq example1.fastq -derep 1 -derep_min 3 -out_good example1_good -out_bad example1_bad -no_qual_header + + +(4) Paired-end data +=================== +Paired-end data is processed similar to single read data. The only difference is that two input files are required (either two FASTA or two FASTQ files). The second file is specified either with "-fasta2 file.fa" or "-fastq2 file.fq". + diff -r 000000000000 -r 9790cfb46d03 tool_dependencies.xml --- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/tool_dependencies.xml Mon Oct 07 15:34:32 2013 -0400 @@ -0,0 +1,30 @@ + + + + $REPOSITORY_INSTALL_DIR + + + + + http://downloads.sourceforge.net/project/bowtie-bio/bowtie2/2.1.0/bowtie2-2.1.0-source.zip + perl Makefile.PL PREFIX=$INSTALL_DIR + make + make install + + http://search.cpan.org/CPAN/authors/id/M/MA/MAKAMAKA/JSON-2.59.tar.gz + perl Makefile.PL PREFIX=$INSTALL_DIR + make + make install + + + $INSTALL_DIR/lib/perl5/site_perl + + + + + + + + + +