comparison docs/scripts/txt/AnalyzeSequenceFilesData.txt @ 0:4816e4a8ae95 draft default tip

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 09:23:18 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4816e4a8ae95
1 NAME
2 AnalyzeSequenceFilesData.pl - Analyze sequence and alignment files
3
4 SYNOPSIS
5 AnalyzeSequenceFilesData.pl SequenceFile(s) AlignmentFile(s)...
6
7 AnalyzeSequenceFilesData.pl [-h, --help] [-i, --IgnoreGaps yes | no]
8 [-m, --mode PercentIdentityMatrix | ResidueFrequencyAnalysis | All]
9 [--outdelim comma | tab | semicolon] [-o, --overwrite] [-p, --precision
10 number] [-q, --quote yes | no] [--ReferenceSequence SequenceID |
11 UseFirstSequenceID] [--region "StartResNum, EndResNum, [StartResNum,
12 EndResNum...]" | UseCompleteSequence] [--RegionResiduesMode AminoAcids |
13 NucleicAcids | None] [-w, --WorkingDir dirname] SequenceFile(s)
14 AlignmentFile(s)...
15
16 DESCRIPTION
17 Analyze *SequenceFile(s) and AlignmentFile(s)* data: calculate pairwise
18 percent identity matrix or calculate percent occurrence of various
19 residues in specified sequence regions. All the sequences in the input
20 file must have the same sequence lengths; otherwise, the sequence file
21 is ignored.
22
23 The file names are separated by spaces. All the sequence files in a
24 current directory can be specified by **.aln*, **.msf*, **.fasta*,
25 **.fta*, **.pir* or any other supported formats; additionally, *DirName*
26 corresponds to all the sequence files in the current directory with any
27 of the supported file extension: *.aln, .msf, .fasta, .fta, and .pir*.
28
29 Supported sequence formats are: *ALN/CLustalW*, *GCG/MSF*, *PILEUP/MSF*,
30 *Pearson/FASTA*, and *NBRF/PIR*. Instead of using file extensions, file
31 formats are detected by parsing the contents of *SequenceFile(s) and
32 AlignmentFile(s)*.
33
34 OPTIONS
35 -h, --help
36 Print this help message.
37
38 -i, --IgnoreGaps *yes | no*
39 Ignore gaps during calculation of sequence lengths and specification
40 of regions during residue frequency analysis. Possible values: *yes
41 or no*. Default value: *yes*.
42
43 -m, --mode *PercentIdentityMatrix | ResidueFrequencyAnalysis | All*
44 Specify how to analyze data in sequence files: calculate percent
45 identity matrix or calculate frequency of occurrence of residues in
46 specific regions. During *ResidueFrequencyAnalysis* value of -m,
47 --mode option, output files are generated for both the residue count
48 and percent residue count. Possible values: *PercentIdentityMatrix,
49 ResidueFrequencyAnalysis, or All*. Default value:
50 *PercentIdentityMatrix*.
51
52 --outdelim *comma | tab | semicolon*
53 Output text file delimiter. Possible values: *comma, tab, or
54 semicolon*. Default value: *comma*.
55
56 -o, --overwrite
57 Overwrite existing files.
58
59 -p, --precision *number*
60 Precision of calculated values in the output file. Default: up to
61 *2* decimal places. Valid values: positive integers.
62
63 -q, --quote *yes | no*
64 Put quotes around column values in output text file. Possible
65 values: *yes or no*. Default value: *yes*.
66
67 --ReferenceSequence *SequenceID | UseFirstSequenceID*
68 Specify reference sequence ID to identify regions for performing
69 *ResidueFrequencyAnalysis* specified using -m, --mode option.
70 Default: *UseFirstSequenceID*.
71
72 --region *StartResNum,EndResNum,[StartResNum,EndResNum...] |
73 UseCompleteSequence*
74 Specify how to perform frequency of occurrence analysis for
75 residues: use specific regions indicated by starting and ending
76 residue numbers in reference sequence or use the whole reference
77 sequence as one region. Default: *UseCompleteSequence*.
78
79 Based on the value of -i, --IgnoreGaps option, specified residue
80 numbers *StartResNum,EndResNum* correspond to the positions in the
81 reference sequence without gaps or with gaps.
82
83 For residue numbers corresponding to the reference sequence
84 including gaps, percent occurrence of various residues corresponding
85 to gap position in reference sequence is also calculated.
86
87 --RegionResiduesMode *AminoAcids | NucleicAcids | None*
88 Specify how to process residues in the regions specified using
89 --region option during *ResidueFrequencyAnalysis* calculation:
90 categorize residues as amino acids, nucleic acids, or simply ignore
91 residue category during the calculation. Possible values:
92 *AminoAcids, NucleicAcids or None*. Default value: *None*.
93
94 For *AminoAcids* or *NucleicAcids* values of --RegionResiduesMode
95 option, all the standard amino acids or nucleic acids are listed in
96 the output file for each region; Any gaps and other non standard
97 residues are added to the list as encountered.
98
99 For *None* value of --RegionResiduesMode option, no assumption is
100 made about type of residues. Residue and gaps are added to the list
101 as encountered.
102
103 -r, --root *rootname*
104 New sequence file name is generated using the root:
105 <Root><Mode>.<Ext> and <Root><Mode><RegionNum>.<Ext>. Default new
106 file name: <SequenceFileName><Mode>.<Ext> for
107 *PercentIdentityMatrix* value m, --mode option and
108 <SequenceFileName><Mode><RegionNum>.<Ext> for
109 *ResidueFrequencyAnalysis*. The csv, and tsv <Ext> values are used
110 for comma/semicolon, and tab delimited text files respectively. This
111 option is ignored for multiple input files.
112
113 -w --WorkingDir *text*
114 Location of working directory. Default: current directory.
115
116 EXAMPLES
117 To calculate percent identity matrix for all sequences in Sample1.msf
118 file and generate Sample1PercentIdentityMatrix.csv, type:
119
120 % AnalyzeSequenceFilesData.pl Sample1.msf
121
122 To perform residue frequency analysis for all sequences in Sample1.aln
123 file corresponding to non-gap positions in the first sequence and
124 generate Sample1ResidueFrequencyAnalysisRegion1.csv and
125 Sample1PercentResidueFrequencyAnalysisRegion1.csv files, type:
126
127 % AnalyzeSequenceFilesData.pl -m ResidueFrequencyAnalysis -o
128 Sample1.aln
129
130 To perform residue frequency analysis for all sequences in Sample1.aln
131 file corresponding to all positions in the first sequence and generate
132 TestResidueFrequencyAnalysisRegion1.csv and
133 TestPercentResidueFrequencyAnalysisRegion1.csv files, type:
134
135 % AnalyzeSequenceFilesData.pl -m ResidueFrequencyAnalysis --IgnoreGaps
136 No -o -r Test Sample1.aln
137
138 To perform residue frequency analysis for all sequences in Sample1.aln
139 file corresponding to non-gap residue positions 5 to 10, and 30 to 40 in
140 sequence ACHE_BOVIN and generate
141 Sample1ResidueFrequencyAnalysisRegion1.csv,
142 Sample1ResidueFrequencyAnalysisRegion2.csv,
143 SamplePercentResidueFrequencyAnalysisRegion1.csv, and
144 SamplePercentResidueFrequencyAnalysisRegion2.csv files, type:
145
146 % AnalyzeSequenceFilesData.pl -m ResidueFrequencyAnalysis
147 --ReferenceSequence ACHE_BOVIN --region "5,15,30,40" -o Sample1.msf
148
149 AUTHOR
150 Manish Sud <msud@san.rr.com>
151
152 SEE ALSO
153 ExtractFromSequenceFiles.pl, InfoSequenceFiles.pl
154
155 COPYRIGHT
156 Copyright (C) 2015 Manish Sud. All rights reserved.
157
158 This file is part of MayaChemTools.
159
160 MayaChemTools is free software; you can redistribute it and/or modify it
161 under the terms of the GNU Lesser General Public License as published by
162 the Free Software Foundation; either version 3 of the License, or (at
163 your option) any later version.
164