| 
0
 | 
     1 NAME
 | 
| 
 | 
     2     InfoSequenceFiles.pl - List information about sequence and alignment
 | 
| 
 | 
     3     files
 | 
| 
 | 
     4 
 | 
| 
 | 
     5 SYNOPSIS
 | 
| 
 | 
     6     InfoSequenceFiles.pl SequenceFile(s) AlignmentFile(s)...
 | 
| 
 | 
     7 
 | 
| 
 | 
     8     InfoSequenceFiles.pl [-a, --all] [-c, --count] [-d, --detail infolevel]
 | 
| 
 | 
     9     [-f, --frequency] [--FrequencyBins number | "number, number,
 | 
| 
 | 
    10     [number,...]"] [-h, --help] [-i, --IgnoreGaps yes | no] [-l, --longest]
 | 
| 
 | 
    11     [-s, --shortest] [--SequenceLengths] [-w, --workingdir dirname]
 | 
| 
 | 
    12     SequenceFile(s)...
 | 
| 
 | 
    13 
 | 
| 
 | 
    14 DESCRIPTION
 | 
| 
 | 
    15     List information about contents of *SequenceFile(s) and
 | 
| 
 | 
    16     AlignmentFile(s)*: number of sequences, shortest and longest sequences,
 | 
| 
 | 
    17     distribution of sequence lengths and so on. The file names are separated
 | 
| 
 | 
    18     by spaces. All the sequence files in a current directory can be
 | 
| 
 | 
    19     specified by **.aln*, **.msf*, **.fasta*, **.fta*, **.pir* or any other
 | 
| 
 | 
    20     supported formats; additionally, *DirName* corresponds to all the
 | 
| 
 | 
    21     sequence files in the current directory with any of the supported file
 | 
| 
 | 
    22     extension: *.aln, .msf, .fasta, .fta, and .pir*.
 | 
| 
 | 
    23 
 | 
| 
 | 
    24     Supported sequence formats are: *ALN/CLustalW*, *GCG/MSF*, *PILEUP/MSF*,
 | 
| 
 | 
    25     *Pearson/FASTA*, and *NBRF/PIR*. Instead of using file extensions, file
 | 
| 
 | 
    26     formats are detected by parsing the contents of *SequenceFile(s) and
 | 
| 
 | 
    27     AlignmentFile(s)*.
 | 
| 
 | 
    28 
 | 
| 
 | 
    29 OPTIONS
 | 
| 
 | 
    30     -a, --all
 | 
| 
 | 
    31         List all the available information.
 | 
| 
 | 
    32 
 | 
| 
 | 
    33     -c, --count
 | 
| 
 | 
    34         List number of of sequences. This is default behavior.
 | 
| 
 | 
    35 
 | 
| 
 | 
    36     -d, --detail *InfoLevel*
 | 
| 
 | 
    37         Level of information to print about sequences during various
 | 
| 
 | 
    38         options. Default: *1*. Possible values: *1, 2 or 3*.
 | 
| 
 | 
    39 
 | 
| 
 | 
    40     -f, --frequency
 | 
| 
 | 
    41         List distribution of sequence lengths using the specified number of
 | 
| 
 | 
    42         bins or bin range specified using FrequencyBins option.
 | 
| 
 | 
    43 
 | 
| 
 | 
    44         This option is ignored for input files containing only single
 | 
| 
 | 
    45         sequence.
 | 
| 
 | 
    46 
 | 
| 
 | 
    47     --FrequencyBins *number | "number,number,[number,...]"*
 | 
| 
 | 
    48         This value is used with -f, --frequency option to list distribution
 | 
| 
 | 
    49         of sequence lengths using the specified number of bins or bin range.
 | 
| 
 | 
    50         Default value: *10*.
 | 
| 
 | 
    51 
 | 
| 
 | 
    52         The bin range list is used to group sequence lengths into different
 | 
| 
 | 
    53         groups; It must contain values in ascending order. Examples:
 | 
| 
 | 
    54 
 | 
| 
 | 
    55             100,200,300,400,500,600
 | 
| 
 | 
    56             200,400,600,800,1000
 | 
| 
 | 
    57 
 | 
| 
 | 
    58         The frequency value calculated for a specific bin corresponds to all
 | 
| 
 | 
    59         the sequence lengths which are greater than the previous bin value
 | 
| 
 | 
    60         and less than or equal to the current bin value.
 | 
| 
 | 
    61 
 | 
| 
 | 
    62     -h, --help
 | 
| 
 | 
    63         Print this help message.
 | 
| 
 | 
    64 
 | 
| 
 | 
    65     -i, --IgnoreGaps *yes | no*
 | 
| 
 | 
    66         Ignore gaps during calculation of sequence lengths. Possible values:
 | 
| 
 | 
    67         *yes or no*. Default value: *no*.
 | 
| 
 | 
    68 
 | 
| 
 | 
    69     -l, --longest
 | 
| 
 | 
    70         List information about longest sequence: ID, sequence and sequence
 | 
| 
 | 
    71         length. This option is ignored for input files containing only
 | 
| 
 | 
    72         single sequence.
 | 
| 
 | 
    73 
 | 
| 
 | 
    74     -s, --shortest
 | 
| 
 | 
    75         List information about shortest sequence: ID, sequence and sequence
 | 
| 
 | 
    76         length. This option is ignored for input files containing only
 | 
| 
 | 
    77         single sequence.
 | 
| 
 | 
    78 
 | 
| 
 | 
    79     --SequenceLengths
 | 
| 
 | 
    80         List information about sequence lengths.
 | 
| 
 | 
    81 
 | 
| 
 | 
    82     -w, --WorkingDir *dirname*
 | 
| 
 | 
    83         Location of working directory. Default: current directory.
 | 
| 
 | 
    84 
 | 
| 
 | 
    85 EXAMPLES
 | 
| 
 | 
    86     To count number of sequences in sequence files, type:
 | 
| 
 | 
    87 
 | 
| 
 | 
    88         % InfoSequenceFiles.pl Sample1.fasta
 | 
| 
 | 
    89         % InfoSequenceFiles.pl Sample1.msf Sample1.aln Sample1.pir
 | 
| 
 | 
    90         % InfoSequenceFiles.pl *.fasta *.fta *.msf *.pir *.aln
 | 
| 
 | 
    91 
 | 
| 
 | 
    92     To list all available information with maximum level of available detail
 | 
| 
 | 
    93     for a sequence alignment file Sample1.msf, type:
 | 
| 
 | 
    94 
 | 
| 
 | 
    95         % InfoSequenceFiles.pl -a -d 3 Sample1.msf
 | 
| 
 | 
    96 
 | 
| 
 | 
    97     To list sequence length information after ignoring sequence gaps in
 | 
| 
 | 
    98     Sample1.aln file, type:
 | 
| 
 | 
    99 
 | 
| 
 | 
   100         % InfoSequenceFiles.pl --SequenceLengths --IgnoreGaps Yes
 | 
| 
 | 
   101           Sample1.aln
 | 
| 
 | 
   102 
 | 
| 
 | 
   103     To list shortest and longest sequence length information after ignoring
 | 
| 
 | 
   104     sequence gaps in Sample1.aln file, type:
 | 
| 
 | 
   105 
 | 
| 
 | 
   106         % InfoSequenceFiles.pl --longest --shortest --IgnoreGaps Yes
 | 
| 
 | 
   107           Sample1.aln
 | 
| 
 | 
   108 
 | 
| 
 | 
   109     To list distribution of sequence lengths after ignoring sequence gaps in
 | 
| 
 | 
   110     Sample1.aln file and report the frequency distribution into 10 bins,
 | 
| 
 | 
   111     type:
 | 
| 
 | 
   112 
 | 
| 
 | 
   113         % InfoSequenceFiles.pl --frequency --FrequencyBins 10
 | 
| 
 | 
   114           --IgnoreGaps Yes Sample1.aln
 | 
| 
 | 
   115 
 | 
| 
 | 
   116     To list distribution of sequence lengths after ignoring sequence gaps in
 | 
| 
 | 
   117     Sample1.aln file and report the frequency distribution into specified
 | 
| 
 | 
   118     bin range, type:
 | 
| 
 | 
   119 
 | 
| 
 | 
   120         % InfoSequenceFiles.pl --frequency --FrequencyBins
 | 
| 
 | 
   121           "150,200,250,300,350" --IgnoreGaps Yes Sample1.aln
 | 
| 
 | 
   122 
 | 
| 
 | 
   123 AUTHOR
 | 
| 
 | 
   124     Manish Sud <msud@san.rr.com>
 | 
| 
 | 
   125 
 | 
| 
 | 
   126 SEE ALSO
 | 
| 
 | 
   127     AnalyzeSequenceFilesData.pl, ExtractFromSequenceFiles.pl,
 | 
| 
 | 
   128     InfoAminoAcids.pl, InfoNucleicAcids.pl
 | 
| 
 | 
   129 
 | 
| 
 | 
   130 COPYRIGHT
 | 
| 
 | 
   131     Copyright (C) 2015 Manish Sud. All rights reserved.
 | 
| 
 | 
   132 
 | 
| 
 | 
   133     This file is part of MayaChemTools.
 | 
| 
 | 
   134 
 | 
| 
 | 
   135     MayaChemTools is free software; you can redistribute it and/or modify it
 | 
| 
 | 
   136     under the terms of the GNU Lesser General Public License as published by
 | 
| 
 | 
   137     the Free Software Foundation; either version 3 of the License, or (at
 | 
| 
 | 
   138     your option) any later version.
 | 
| 
 | 
   139 
 |