comparison docs/scripts/txt/AnalyzeTextFilesData.txt @ 0:4816e4a8ae95 draft default tip

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 09:23:18 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4816e4a8ae95
1 NAME
2 AnalyzeTextFilesData.pl - Analyze numerical coulmn data in TextFile(s)
3
4 SYNOPSIS
5 AnalyzeTextFilesData.pl TextFile(s)...
6
7 AnalyzeTextFilesData.pl [-c, --colmode colnum | collabel] [--columns
8 "colnum,[colnum,...]" | "collabel,[collabel,...]" | All] [--columnpairs
9 "colnum,colnum,[colnum,colnum]..." |
10 "collabel,collabel,[collabel,collabel]..." | AllPairs] [-d, --detail
11 infolevel] [-f, --fast] [--frequencybins number |
12 "number,number,[number,...]"] [-h, --help] [--indelim comma | semicolon]
13 [--klargest number] [--ksmallest number] [-m, --mode
14 DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All |
15 "function1, [function2,...]"] [-o, --overwrite] [--outdelim comma | tab
16 | semicolon] [-p, --precision number] [-q, --quote yes | no] [-r, --root
17 rootname] [--trimfraction number] [-w, --workingdir dirname]
18 TextFiles(s)...
19
20 DESCRIPTION
21 Anaylze numerical column data in *TextFile(s)* using a combination of
22 various statistical functions; Non-numerical values are simply ignored.
23 For *Correlation, RSquare, and Covariance* analysis, the count of valid
24 values in specifed column pair must be same; otherwise, column pair is
25 ignored. The file names are separated by space. The valid file
26 extensions are *.csv* and *.tsv* for comma/semicolon and tab delimited
27 text files respectively. All other file names are ignored. All the text
28 files in a current directory can be specified by **.csv*, **.tsv*, or
29 the current directory name. The --indelim option determines the format
30 of *TextFile(s)*. Any file which doesn't correspond to the format
31 indicated by --indelim option is ignored.
32
33 OPTIONS
34 -c, --colmode *colnum | collabel*
35 Specify how columns are identified in TextFile(s): using column
36 number or column label. Possible values: *colnum or collabel*.
37 Default value: *colnum*.
38
39 --columns *"colnum,[colnum,...]" | "collabel,[collabel]..." | All*
40 This value is mode specific. It's a list of comma delimited columns
41 to use for data analysis. Default value: *First column*.
42
43 This value is ignored during *Correlation/Pearson Correlation* and
44 *Covariance* data analysis; -coulmnparis option is used instead.
45
46 For *colnum* value of -c, --colmode option, input values format is:
47 *colnum,colnum,...*. Example:
48
49 1,3,5
50
51 For *collabel* value of -c, --colmode option, input values format
52 is: *collabel,collabel,..*. Example:
53
54 ALogP,MolWeight,EC50
55
56 --columnpairs *"colnum,colnum,[colnum,colnum,...]" |
57 "collabel,collabel,[collabel,collabel,...]" | AllPairs*
58 This value is mode specific and is only used for *Correlation,
59 PearsonCorrelation, or Covariance* value of -m, --mode option. It is
60 a comma delimited list of column pairs to use for data analysis
61 during *Correlation* and *Covariance* calculations. Default value:
62 *First column, Second column*.
63
64 For *colnum* value of -c, --colmode option, input values format is:
65 *colnum,colnum,[colnum,colnum]...*. Example:
66
67 1,3,5,6,1,6
68
69 For *collabel* value of -c, --colmode option, input values format
70 is: *collabel,collabel,[collabel,collabel]..*. Example:
71
72 MolWeight,EC50,NumN+O,PSA
73
74 For *AllPairs* value of --columnparis option, all column pairs are
75 used for *Correlation* and *Covariance* calculations.
76
77 -d, --detail *infolevel*
78 Level of information to print about column values being ignored.
79 Default: *1*. Possible values: 1, 2, 3, or 4.
80
81 -f, --fast
82 In this mode, all the columns specified for analysis are assumed to
83 contain numerical data and no checking is performed before analysis.
84 By default, only numerical data is used for analysis; other types of
85 column data is ignored.
86
87 --frequencybins *number | "number,number,[number,...]"*
88 Specify number of bins or bin range to use for frequency analysis.
89 Default value: *10*
90
91 Number of bins value along with the smallest and largest value for a
92 column is used to group the column values into different groups.
93
94 The bin range list is used to group values for a column into
95 different groups; It must contain values in ascending order.
96 Examples:
97
98 10,20,30
99 0.1,0.2,0.3,0.4,0.5
100
101 The frequency value calculated for a specific bin corresponds to all
102 the column values which are greater than the previous bin value and
103 less than or equal to the current bin value.
104
105 -h, --help
106 Print this help message.
107
108 --indelim *comma | semicolon*
109 Input delimiter for CSV *TextFile(s)*. Possible values: *comma or
110 semicolon*. Default value: *comma*. For TSV files, this option is
111 ignored and *tab* is used as a delimiter.
112
113 --klargest *number*
114 Kth largest value to find by *KLargest* function. Default value: *2*
115 Valid values: positive integers.
116
117 --ksmallest *number*
118 Kth smallest value to find by *KSmallest* function. Default value:
119 *2*. Valid values: positive integers.
120
121 -m, --mode *DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All
122 | "function1, [function2,...]"*
123 Specify how to analyze data in TextFile(s): calculate basic or all
124 descriptive statistics; or use a comma delimited list of supported
125 statistical functions. Possible values: *DescriptiveStatisticsBasic
126 | DescriptiveStatisticsAll | "function1,[function2]..."*. Default
127 value: *DescriptiveStatisticsBasic*
128
129 *DescriptiveStatisticsBasic* includes these functions: *Count,
130 Maximum, Minimum, Mean, Median, Sum, StandardDeviation,
131 StandardError, Variance*.
132
133 *DescriptiveStatisticsAll*, in addition to
134 *DescriptiveStatisticsBasic* functions, includes: *GeometricMean,
135 Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Mode,
136 RSquare, Skewness, TrimMean*.
137
138 *All* uses complete list of supported functions: *Average,
139 AverageDeviation, Correlation, Count, Covariance, GeometricMean,
140 Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Maximum,
141 Minimum, Mean, Median, Mode, RSquare, Skewness, Sum, SumOfSquares,
142 StandardDeviation, StandardDeviationN, StandardError,
143 StandardScores, StandardScoresN, TrimMean, Variance, VarianceN*. The
144 function names ending with N calculate corresponding values assuming
145 an entire population instead of a population sample.
146
147 Here are the formulas for these functions:
148
149 Average: See Mean
150
151 AverageDeviation: SUM( ABS(x[i] - Xmean) ) / n
152
153 Correlation: See Pearson Correlation
154
155 Covariance: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / n
156
157 GeometricMean: NthROOT( PRODUCT(x[i]) )
158
159 HarmonicMean: 1 / ( SUM(1/x[i]) / n )
160
161 Mean: SUM( x[i] ) / n
162
163 Median: Xsorted[(n - 1)/2 + 1] for even values of n; (Xsorted[n/2] +
164 Xsorted[n/2 + 1])/2 for odd values of n.
165
166 Kurtosis: [ {n(n + 1)/(n - 1)(n - 2)(n - 3)} SUM{ ((x[i] -
167 Xmean)/STDDEV)^4 } ] - {3((n - 1)^2)}/{(n - 2)(n-3)}
168
169 PearsonCorrelation: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / SQRT( SUM(
170 (x[i] - Xmean)^2 ) (SUM( (y[i] - Ymean)^2 )) )
171
172 RSquare: PearsonCorrelation^2
173
174 Skewness: {n/(n - 1)(n - 2)} SUM{ ((x[i] - Xmean)/STDDEV)^3 }
175
176 StandardDeviation: SQRT ( SUM( (x[i] - Mean)^2 ) / (n - 1) )
177
178 StandardDeviationN: SQRT ( SUM( (x[i] - Mean)^2 ) / n )
179
180 StandardError: StandardDeviation / SQRT( n )
181
182 StandardScore: (x[i] - Mean) / (n - 1)
183
184 StandardScoreN: (x[i] - Mean) / n
185
186 Variance: SUM( (x[i] - Xmean)^2 / (n - 1) )
187
188 VarianceN: SUM( (x[i] - Xmean)^2 / n )
189
190 -o, --overwrite
191 Overwrite existing files.
192
193 --outdelim *comma | tab | semicolon*
194 Output text file delimiter. Possible values: *comma, tab, or
195 semicolon* Default value: *comma*.
196
197 -p, --precision *number*
198 Precision of calculated values in the output file. Default: up to
199 *2* decimal places. Valid values: positive integers.
200
201 -q, --quote *yes | no*
202 Put quotes around column values in output text file. Possible
203 values: *yes or no*. Default value: *yes*.
204
205 -r, --root *rootname*
206 New text file name is generated using the root: <Root>.<Ext>.
207 Default new file name: <InitialTextFileName><Mode>.<Ext>. Based on
208 the specified analysis, <Mode> corresponds to one of these values:
209 DescriptiveStatisticsBasic, DescriptiveStatisticsAll, AllStatistics,
210 SpecifiedStatistics, Covariance, Correlation, Frequency, or
211 StandardScores. The csv, and tsv <Ext> values are used for
212 comma/semicolon, and tab delimited text files respectively. This
213 option is ignored for multiple input files.
214
215 --trimfraction *number*
216 Fraction of data to exclude from the top and bottom of the data set
217 during *TrimMean* calculation. Default value: *0.1*. Valid values: >
218 0 and < 1.
219
220 -w --workingdir *text*
221 Location of working directory. Default: current directory.
222
223 EXAMPLES
224 To calculate basic statistics for data in first column and generate a
225 NewSample1DescriptiveStatisticsBasic.csv file, type:
226
227 % AnalyzeTextFilesData.pl -o -r NewSample1 Sample1.csv
228
229 To calculate basic statistics for data in third column and generate a
230 NewSample1DescriptiveStatisticsBasic.csv file, type:
231
232 % AnalyzeTextFilesData.pl --columns 3 -o -r NewSample1 Sample1.csv
233
234 To calculate basic statistics for data in MolWeight column and generate
235 a NewSample1DescriptiveStatisticsBasic.csv file, type:
236
237 % AnalyzeTextFilesData.pl -colmode collabel --columns MolWeight -o
238 -r NewSample1 Sample1.csv
239
240 To calculate all available statistics for data in third column and all
241 column pairs, and generate NewSample1DescriptiveStatisticsAll.csv,
242 NewSample1CorrelationMatrix.csv, NewSample1CorrelationMatrix.csv, and
243 NewSample1MolWeightFrequencyAnalysis.csv files, type:
244
245 % AnalyzeTextFilesData.pl -m DescriptiveStatisticsAll --columns 3 -o
246 --columnpairs AllPairs -r NewSample1 Sample1.csv
247
248 To compute frequency distribution of data in third column into five bins
249 and generate NewSample1MolWeightFrequencyAnalysis.csv, type:
250
251 % AnalyzeTextFilesData.pl -m Frequency --frequencybins 5 --columns 3
252 -o -r NewSample1 Sample1.csv
253
254 To compute frequency distribution of data in third column into specified
255 bin range values, and generate NewSample1MolWeightFrequencyAnalysis.csv,
256 type:
257
258 % AnalyzeTextFilesData.pl -m Frequency --frequencybins "100,200,400"
259 --columns 3 -o -r NewSample1 Sample1.csv
260
261 To calculate all available statistics for data in all columns and column
262 pairs, type:
263
264 % AnalyzeTextFilesData.pl -m All --columns All --columnpairs
265 AllPairs -o -r NewSample1 Sample1.csv
266
267 AUTHOR
268 Manish Sud <msud@san.rr.com>
269
270 SEE ALSO
271 JoinTextFiles.pl, MergeTextFilesWithSD.pl, ModifyTextFilesFormat.pl,
272 SplitTextFiles.pl, TextFilesToHTML.pl
273
274 COPYRIGHT
275 Copyright (C) 2015 Manish Sud. All rights reserved.
276
277 This file is part of MayaChemTools.
278
279 MayaChemTools is free software; you can redistribute it and/or modify it
280 under the terms of the GNU Lesser General Public License as published by
281 the Free Software Foundation; either version 3 of the License, or (at
282 your option) any later version.
283