0
|
1 NAME
|
|
2 AnalyzeSequenceFilesData.pl - Analyze sequence and alignment files
|
|
3
|
|
4 SYNOPSIS
|
|
5 AnalyzeSequenceFilesData.pl SequenceFile(s) AlignmentFile(s)...
|
|
6
|
|
7 AnalyzeSequenceFilesData.pl [-h, --help] [-i, --IgnoreGaps yes | no]
|
|
8 [-m, --mode PercentIdentityMatrix | ResidueFrequencyAnalysis | All]
|
|
9 [--outdelim comma | tab | semicolon] [-o, --overwrite] [-p, --precision
|
|
10 number] [-q, --quote yes | no] [--ReferenceSequence SequenceID |
|
|
11 UseFirstSequenceID] [--region "StartResNum, EndResNum, [StartResNum,
|
|
12 EndResNum...]" | UseCompleteSequence] [--RegionResiduesMode AminoAcids |
|
|
13 NucleicAcids | None] [-w, --WorkingDir dirname] SequenceFile(s)
|
|
14 AlignmentFile(s)...
|
|
15
|
|
16 DESCRIPTION
|
|
17 Analyze *SequenceFile(s) and AlignmentFile(s)* data: calculate pairwise
|
|
18 percent identity matrix or calculate percent occurrence of various
|
|
19 residues in specified sequence regions. All the sequences in the input
|
|
20 file must have the same sequence lengths; otherwise, the sequence file
|
|
21 is ignored.
|
|
22
|
|
23 The file names are separated by spaces. All the sequence files in a
|
|
24 current directory can be specified by **.aln*, **.msf*, **.fasta*,
|
|
25 **.fta*, **.pir* or any other supported formats; additionally, *DirName*
|
|
26 corresponds to all the sequence files in the current directory with any
|
|
27 of the supported file extension: *.aln, .msf, .fasta, .fta, and .pir*.
|
|
28
|
|
29 Supported sequence formats are: *ALN/CLustalW*, *GCG/MSF*, *PILEUP/MSF*,
|
|
30 *Pearson/FASTA*, and *NBRF/PIR*. Instead of using file extensions, file
|
|
31 formats are detected by parsing the contents of *SequenceFile(s) and
|
|
32 AlignmentFile(s)*.
|
|
33
|
|
34 OPTIONS
|
|
35 -h, --help
|
|
36 Print this help message.
|
|
37
|
|
38 -i, --IgnoreGaps *yes | no*
|
|
39 Ignore gaps during calculation of sequence lengths and specification
|
|
40 of regions during residue frequency analysis. Possible values: *yes
|
|
41 or no*. Default value: *yes*.
|
|
42
|
|
43 -m, --mode *PercentIdentityMatrix | ResidueFrequencyAnalysis | All*
|
|
44 Specify how to analyze data in sequence files: calculate percent
|
|
45 identity matrix or calculate frequency of occurrence of residues in
|
|
46 specific regions. During *ResidueFrequencyAnalysis* value of -m,
|
|
47 --mode option, output files are generated for both the residue count
|
|
48 and percent residue count. Possible values: *PercentIdentityMatrix,
|
|
49 ResidueFrequencyAnalysis, or All*. Default value:
|
|
50 *PercentIdentityMatrix*.
|
|
51
|
|
52 --outdelim *comma | tab | semicolon*
|
|
53 Output text file delimiter. Possible values: *comma, tab, or
|
|
54 semicolon*. Default value: *comma*.
|
|
55
|
|
56 -o, --overwrite
|
|
57 Overwrite existing files.
|
|
58
|
|
59 -p, --precision *number*
|
|
60 Precision of calculated values in the output file. Default: up to
|
|
61 *2* decimal places. Valid values: positive integers.
|
|
62
|
|
63 -q, --quote *yes | no*
|
|
64 Put quotes around column values in output text file. Possible
|
|
65 values: *yes or no*. Default value: *yes*.
|
|
66
|
|
67 --ReferenceSequence *SequenceID | UseFirstSequenceID*
|
|
68 Specify reference sequence ID to identify regions for performing
|
|
69 *ResidueFrequencyAnalysis* specified using -m, --mode option.
|
|
70 Default: *UseFirstSequenceID*.
|
|
71
|
|
72 --region *StartResNum,EndResNum,[StartResNum,EndResNum...] |
|
|
73 UseCompleteSequence*
|
|
74 Specify how to perform frequency of occurrence analysis for
|
|
75 residues: use specific regions indicated by starting and ending
|
|
76 residue numbers in reference sequence or use the whole reference
|
|
77 sequence as one region. Default: *UseCompleteSequence*.
|
|
78
|
|
79 Based on the value of -i, --IgnoreGaps option, specified residue
|
|
80 numbers *StartResNum,EndResNum* correspond to the positions in the
|
|
81 reference sequence without gaps or with gaps.
|
|
82
|
|
83 For residue numbers corresponding to the reference sequence
|
|
84 including gaps, percent occurrence of various residues corresponding
|
|
85 to gap position in reference sequence is also calculated.
|
|
86
|
|
87 --RegionResiduesMode *AminoAcids | NucleicAcids | None*
|
|
88 Specify how to process residues in the regions specified using
|
|
89 --region option during *ResidueFrequencyAnalysis* calculation:
|
|
90 categorize residues as amino acids, nucleic acids, or simply ignore
|
|
91 residue category during the calculation. Possible values:
|
|
92 *AminoAcids, NucleicAcids or None*. Default value: *None*.
|
|
93
|
|
94 For *AminoAcids* or *NucleicAcids* values of --RegionResiduesMode
|
|
95 option, all the standard amino acids or nucleic acids are listed in
|
|
96 the output file for each region; Any gaps and other non standard
|
|
97 residues are added to the list as encountered.
|
|
98
|
|
99 For *None* value of --RegionResiduesMode option, no assumption is
|
|
100 made about type of residues. Residue and gaps are added to the list
|
|
101 as encountered.
|
|
102
|
|
103 -r, --root *rootname*
|
|
104 New sequence file name is generated using the root:
|
|
105 <Root><Mode>.<Ext> and <Root><Mode><RegionNum>.<Ext>. Default new
|
|
106 file name: <SequenceFileName><Mode>.<Ext> for
|
|
107 *PercentIdentityMatrix* value m, --mode option and
|
|
108 <SequenceFileName><Mode><RegionNum>.<Ext> for
|
|
109 *ResidueFrequencyAnalysis*. The csv, and tsv <Ext> values are used
|
|
110 for comma/semicolon, and tab delimited text files respectively. This
|
|
111 option is ignored for multiple input files.
|
|
112
|
|
113 -w --WorkingDir *text*
|
|
114 Location of working directory. Default: current directory.
|
|
115
|
|
116 EXAMPLES
|
|
117 To calculate percent identity matrix for all sequences in Sample1.msf
|
|
118 file and generate Sample1PercentIdentityMatrix.csv, type:
|
|
119
|
|
120 % AnalyzeSequenceFilesData.pl Sample1.msf
|
|
121
|
|
122 To perform residue frequency analysis for all sequences in Sample1.aln
|
|
123 file corresponding to non-gap positions in the first sequence and
|
|
124 generate Sample1ResidueFrequencyAnalysisRegion1.csv and
|
|
125 Sample1PercentResidueFrequencyAnalysisRegion1.csv files, type:
|
|
126
|
|
127 % AnalyzeSequenceFilesData.pl -m ResidueFrequencyAnalysis -o
|
|
128 Sample1.aln
|
|
129
|
|
130 To perform residue frequency analysis for all sequences in Sample1.aln
|
|
131 file corresponding to all positions in the first sequence and generate
|
|
132 TestResidueFrequencyAnalysisRegion1.csv and
|
|
133 TestPercentResidueFrequencyAnalysisRegion1.csv files, type:
|
|
134
|
|
135 % AnalyzeSequenceFilesData.pl -m ResidueFrequencyAnalysis --IgnoreGaps
|
|
136 No -o -r Test Sample1.aln
|
|
137
|
|
138 To perform residue frequency analysis for all sequences in Sample1.aln
|
|
139 file corresponding to non-gap residue positions 5 to 10, and 30 to 40 in
|
|
140 sequence ACHE_BOVIN and generate
|
|
141 Sample1ResidueFrequencyAnalysisRegion1.csv,
|
|
142 Sample1ResidueFrequencyAnalysisRegion2.csv,
|
|
143 SamplePercentResidueFrequencyAnalysisRegion1.csv, and
|
|
144 SamplePercentResidueFrequencyAnalysisRegion2.csv files, type:
|
|
145
|
|
146 % AnalyzeSequenceFilesData.pl -m ResidueFrequencyAnalysis
|
|
147 --ReferenceSequence ACHE_BOVIN --region "5,15,30,40" -o Sample1.msf
|
|
148
|
|
149 AUTHOR
|
|
150 Manish Sud <msud@san.rr.com>
|
|
151
|
|
152 SEE ALSO
|
|
153 ExtractFromSequenceFiles.pl, InfoSequenceFiles.pl
|
|
154
|
|
155 COPYRIGHT
|
|
156 Copyright (C) 2015 Manish Sud. All rights reserved.
|
|
157
|
|
158 This file is part of MayaChemTools.
|
|
159
|
|
160 MayaChemTools is free software; you can redistribute it and/or modify it
|
|
161 under the terms of the GNU Lesser General Public License as published by
|
|
162 the Free Software Foundation; either version 3 of the License, or (at
|
|
163 your option) any later version.
|
|
164
|