1
|
1 NAME
|
|
2 InfoSequenceFiles.pl - List information about sequence and alignment
|
|
3 files
|
|
4
|
|
5 SYNOPSIS
|
|
6 InfoSequenceFiles.pl SequenceFile(s) AlignmentFile(s)...
|
|
7
|
|
8 InfoSequenceFiles.pl [-a, --all] [-c, --count] [-d, --detail infolevel]
|
|
9 [-f, --frequency] [--FrequencyBins number | "number, number,
|
|
10 [number,...]"] [-h, --help] [-i, --IgnoreGaps yes | no] [-l, --longest]
|
|
11 [-s, --shortest] [--SequenceLengths] [-w, --workingdir dirname]
|
|
12 SequenceFile(s)...
|
|
13
|
|
14 DESCRIPTION
|
|
15 List information about contents of *SequenceFile(s) and
|
|
16 AlignmentFile(s)*: number of sequences, shortest and longest sequences,
|
|
17 distribution of sequence lengths and so on. The file names are separated
|
|
18 by spaces. All the sequence files in a current directory can be
|
|
19 specified by **.aln*, **.msf*, **.fasta*, **.fta*, **.pir* or any other
|
|
20 supported formats; additionally, *DirName* corresponds to all the
|
|
21 sequence files in the current directory with any of the supported file
|
|
22 extension: *.aln, .msf, .fasta, .fta, and .pir*.
|
|
23
|
|
24 Supported sequence formats are: *ALN/CLustalW*, *GCG/MSF*, *PILEUP/MSF*,
|
|
25 *Pearson/FASTA*, and *NBRF/PIR*. Instead of using file extensions, file
|
|
26 formats are detected by parsing the contents of *SequenceFile(s) and
|
|
27 AlignmentFile(s)*.
|
|
28
|
|
29 OPTIONS
|
|
30 -a, --all
|
|
31 List all the available information.
|
|
32
|
|
33 -c, --count
|
|
34 List number of of sequences. This is default behavior.
|
|
35
|
|
36 -d, --detail *InfoLevel*
|
|
37 Level of information to print about sequences during various
|
|
38 options. Default: *1*. Possible values: *1, 2 or 3*.
|
|
39
|
|
40 -f, --frequency
|
|
41 List distribution of sequence lengths using the specified number of
|
|
42 bins or bin range specified using FrequencyBins option.
|
|
43
|
|
44 This option is ignored for input files containing only single
|
|
45 sequence.
|
|
46
|
|
47 --FrequencyBins *number | "number,number,[number,...]"*
|
|
48 This value is used with -f, --frequency option to list distribution
|
|
49 of sequence lengths using the specified number of bins or bin range.
|
|
50 Default value: *10*.
|
|
51
|
|
52 The bin range list is used to group sequence lengths into different
|
|
53 groups; It must contain values in ascending order. Examples:
|
|
54
|
|
55 100,200,300,400,500,600
|
|
56 200,400,600,800,1000
|
|
57
|
|
58 The frequency value calculated for a specific bin corresponds to all
|
|
59 the sequence lengths which are greater than the previous bin value
|
|
60 and less than or equal to the current bin value.
|
|
61
|
|
62 -h, --help
|
|
63 Print this help message.
|
|
64
|
|
65 -i, --IgnoreGaps *yes | no*
|
|
66 Ignore gaps during calculation of sequence lengths. Possible values:
|
|
67 *yes or no*. Default value: *no*.
|
|
68
|
|
69 -l, --longest
|
|
70 List information about longest sequence: ID, sequence and sequence
|
|
71 length. This option is ignored for input files containing only
|
|
72 single sequence.
|
|
73
|
|
74 -s, --shortest
|
|
75 List information about shortest sequence: ID, sequence and sequence
|
|
76 length. This option is ignored for input files containing only
|
|
77 single sequence.
|
|
78
|
|
79 --SequenceLengths
|
|
80 List information about sequence lengths.
|
|
81
|
|
82 -w, --WorkingDir *dirname*
|
|
83 Location of working directory. Default: current directory.
|
|
84
|
|
85 EXAMPLES
|
|
86 To count number of sequences in sequence files, type:
|
|
87
|
|
88 % InfoSequenceFiles.pl Sample1.fasta
|
|
89 % InfoSequenceFiles.pl Sample1.msf Sample1.aln Sample1.pir
|
|
90 % InfoSequenceFiles.pl *.fasta *.fta *.msf *.pir *.aln
|
|
91
|
|
92 To list all available information with maximum level of available detail
|
|
93 for a sequence alignment file Sample1.msf, type:
|
|
94
|
|
95 % InfoSequenceFiles.pl -a -d 3 Sample1.msf
|
|
96
|
|
97 To list sequence length information after ignoring sequence gaps in
|
|
98 Sample1.aln file, type:
|
|
99
|
|
100 % InfoSequenceFiles.pl --SequenceLengths --IgnoreGaps Yes
|
|
101 Sample1.aln
|
|
102
|
|
103 To list shortest and longest sequence length information after ignoring
|
|
104 sequence gaps in Sample1.aln file, type:
|
|
105
|
|
106 % InfoSequenceFiles.pl --longest --shortest --IgnoreGaps Yes
|
|
107 Sample1.aln
|
|
108
|
|
109 To list distribution of sequence lengths after ignoring sequence gaps in
|
|
110 Sample1.aln file and report the frequency distribution into 10 bins,
|
|
111 type:
|
|
112
|
|
113 % InfoSequenceFiles.pl --frequency --FrequencyBins 10
|
|
114 --IgnoreGaps Yes Sample1.aln
|
|
115
|
|
116 To list distribution of sequence lengths after ignoring sequence gaps in
|
|
117 Sample1.aln file and report the frequency distribution into specified
|
|
118 bin range, type:
|
|
119
|
|
120 % InfoSequenceFiles.pl --frequency --FrequencyBins
|
|
121 "150,200,250,300,350" --IgnoreGaps Yes Sample1.aln
|
|
122
|
|
123 AUTHOR
|
|
124 Manish Sud <msud@san.rr.com>
|
|
125
|
|
126 SEE ALSO
|
|
127 AnalyzeSequenceFilesData.pl, ExtractFromSequenceFiles.pl,
|
|
128 InfoAminoAcids.pl, InfoNucleicAcids.pl
|
|
129
|
|
130 COPYRIGHT
|
|
131 Copyright (C) 2015 Manish Sud. All rights reserved.
|
|
132
|
|
133 This file is part of MayaChemTools.
|
|
134
|
|
135 MayaChemTools is free software; you can redistribute it and/or modify it
|
|
136 under the terms of the GNU Lesser General Public License as published by
|
|
137 the Free Software Foundation; either version 3 of the License, or (at
|
|
138 your option) any later version.
|
|
139
|