comparison docs/scripts/txt/AnalyzeSDFilesData.txt @ 0:4816e4a8ae95 draft default tip

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 09:23:18 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4816e4a8ae95
1 NAME
2 AnalyzeSDFilesData.pl - Analyze numerical data field values in SDFile(s)
3
4 SYNOPSIS
5 AnalyzeSDFilesData.pl SDFile(s)...
6
7 AnalyzeSDFilesData.pl [--datafields "fieldlabel,[fieldlabel,...]" | All]
8 [--datafieldpairs "fieldlabel,fieldlabel,[fieldlabel,fieldlabel,...]" |
9 AllPairs] [-d, --detail infolevel] [-f, --fast] [--frequencybins number
10 | "number,number,[number,...]"] [-h, --help] [--klargest number]
11 [--ksmallest number] [-m, --mode DescriptiveStatisticsBasic |
12 DescriptiveStatisticsAll | All | "function1, [function2,...]"]
13 [--trimfraction number] [-w, --workingdir dirname] SDFiles(s)...
14
15 DESCRIPTION
16 Analyze numerical data field values in *SDFile(s)* using a combination
17 of various statistical functions; Non-numerical values are simply
18 ignored. For *Correlation, RSquare, and Covariance* analysis, the count
19 of valid values in specified data field pairs must be same; otherwise,
20 column data field pair is ignored. The file names are separated by
21 space.The valid file extensions are *.sdf* and *.sd*. All other file
22 names are ignored. All the SD files in a current directory can be
23 specified either by **.sdf* or the current directory name.
24
25 OPTIONS
26 --datafields *"fieldlabel,[fieldlabel,...]" | Common | All*
27 Data fields to use for analysis. Possible values: list of comma
28 separated data field labels, data fields common to all records, or
29 all data fields. Default value: *Common*. Examples:
30
31 ALogP,MolWeight,EC50
32 "MolWeight,PSA"
33
34 --datafieldpairs *"fieldlabel,fieldlabel,[fieldlabel,fieldlabel,...]" |
35 CommonPairs | AllPairs*
36 This value is mode specific and is only used for *Correlation,
37 PearsonCorrelation, or Covariance* value of -m, --mode option. It
38 specifies data field label pairs to use for data analysis during
39 *Correlation* and *Covariance* calculations. Possible values: comma
40 delimited list of data field label pairs, data field label pairs
41 common to all records, or all data field pairs. Default
42 value:*CommonPairs*. Example:
43
44 MolWeight,EC50,NumN+O,PSA
45
46 For *AllPairs* value of --datafieldpairs option, all data field
47 label pairs are used for *Correlation* and *Covariance*
48 calculations.
49
50 -d, --detail *infolevel*
51 Level of information to print about column values being ignored.
52 Default: *0*. Possible values: 0, 1, 2, 3, or 4.
53
54 -f, --fast
55 In this mode, all the data field values specified for analysis are
56 assumed to contain numerical data and no checking is performed
57 before analysis. By default, only numerical data is used for
58 analysis; other types of column data is ignored.
59
60 --frequencybins *number | "number,number,[number,...]"*
61 Specify number of bins or bin range to use for frequency analysis.
62 Default value: *10*
63
64 Number of bins value along with the smallest and largest value for a
65 column is used to group the column values into different groups.
66
67 The bin range list is used to group values for a column into
68 different groups; It must contain values in ascending order.
69 Examples:
70
71 10,20,30
72 0.1,0.2,0.3,0.4,0.5
73
74 The frequency value calculated for a specific bin corresponds to all
75 the column values which are greater than the previous bin value and
76 less than or equal to the current bin value.
77
78 -h, --help
79 Print this help message.
80
81 --klargest *number*
82 Kth largest value to find by *KLargest* function. Default value:
83 *2*. Valid values: positive integers.
84
85 --ksmallest *number*
86 Kth smallest value to find by *KSmallest* function. Default values:
87 *2*. Valid values: positive integers.
88
89 -m, --mode *DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All
90 | "function1, [function2,...]"*
91 Specify how to analyze data in SDFile(s): calculate basic or all
92 descriptive statistics; or use a comma delimited list of supported
93 statistical functions. Possible values: *DescriptiveStatisticsBasic
94 | DescriptiveStatisticsAll | "function1,[function2]..."*. Default
95 value: *DescriptiveStatisticsBasic*
96
97 *DescriptiveStatisticsBasic* includes these functions: *Count,
98 Maximum, Minimum, Mean, Median, Sum, StandardDeviation,
99 StandardError, Variance*.
100
101 *DescriptiveStatisticsAll*, in addition to
102 *DescriptiveStatisticsBasic* functions, includes: *GeometricMean,
103 Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Mode,
104 RSquare, Skewness, TrimMean*.
105
106 *All* uses complete list of supported functions: *Average,
107 AverageDeviation, Correlation, Count, Covariance, GeometricMean,
108 Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Maximum,
109 Minimum, Mean, Median, Mode, RSquare, Skewness, Sum, SumOfSquares,
110 StandardDeviation, StandardDeviationN, StandardError,
111 StandardScores, StandardScoresN, TrimMean, Variance, VarianceN*. The
112 function names ending with N calculate corresponding values assuming
113 an entire population instead of a population sample. Here are the
114 formulas for these functions:
115
116 Average: See Mean
117
118 AverageDeviation: SUM( ABS(x[i] - Xmean) ) / n
119
120 Correlation: See Pearson Correlation
121
122 Covariance: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / n
123
124 GeometricMean: NthROOT( PRODUCT(x[i]) )
125
126 HarmonicMean: 1 / ( SUM(1/x[i]) / n )
127
128 Mean: SUM( x[i] ) / n
129
130 Median: Xsorted[(n - 1)/2 + 1] for even values of n; (Xsorted[n/2] +
131 Xsorted[n/2 + 1])/2 for odd values of n.
132
133 Kurtosis: [ {n(n + 1)/(n - 1)(n - 2)(n - 3)} SUM{ ((x[i] -
134 Xmean)/STDDEV)^4 } ] - {3((n - 1)^2)}/{(n - 2)(n-3)}
135
136 PearsonCorrelation: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / SQRT( SUM(
137 (x[i] - Xmean)^2 ) (SUM( (y[i] - Ymean)^2 )) )
138
139 RSquare: PearsonCorrelation^2
140
141 Skewness: {n/(n - 1)(n - 2)} SUM{ ((x[i] - Xmean)/STDDEV)^3 }
142
143 StandardDeviation: SQRT ( SUM( (x[i] - Mean)^2 ) / (n - 1) )
144
145 StandardDeviationN: SQRT ( SUM( (x[i] - Mean)^2 ) / n )
146
147 StandardError: StandardDeviation / SQRT( n )
148
149 StandardScore: (x[i] - Mean) / (n - 1)
150
151 StandardScoreN: (x[i] - Mean) / n
152
153 Variance: SUM( (x[i] - Xmean)^2 / (n - 1) )
154
155 VarianceN: SUM( (x[i] - Xmean)^2 / n )
156
157 -o, --overwrite
158 Overwrite existing files.
159
160 --outdelim *comma | tab | semicolon*
161 Output text file delimiter. Possible values: *comma, tab, or
162 semicolon* Default value: *comma*.
163
164 -p, --precision *number*
165 Precision of calculated values in the output file. Default: up to
166 *2* decimal places. Valid values: positive integers.
167
168 -q, --quote *yes | no*
169 Put quotes around column values in output text file. Possible
170 values: *yes or no*. Default value: *yes*.
171
172 -r, --root *rootname*
173 New text file name is generated using the root: <Root>.<Ext>.
174 Default new file name: <InitialSDFileName><Mode>.<Ext>. Based on the
175 specified analysis, <Mode> corresponds to one of these values:
176 DescriptiveStatisticsBasic, DescriptiveStatisticsAll, AllStatistics,
177 SpecifiedStatistics, Covariance, Correlation, Frequency, or
178 StandardScores. The csv, and tsv <Ext> values are used for
179 comma/semicolon, and tab delimited text files respectively. This
180 option is ignored for multiple input files.
181
182 --trimfraction *number*
183 Fraction of data to exclude from the top and bottom of the data set
184 during *TrimMean* calculation. Default value: *0.1* Valid values: >
185 0 and < 1.
186
187 -w --workingdir *text*
188 Location of working directory. Default: current directory.
189
190 EXAMPLES
191 To calculate basic statistics for data in all common data fields and
192 generate a NewSample1DescriptiveStatisticsBasic.csv file, type:
193
194 % AnalyzeSDFilesData.pl -o -r NewSample1 Sample1.sdf
195
196 To calculate basic statistics for MolWeight data field and generate a
197 NewSample1DescriptiveStatisticsBasic.csv file, type:
198
199 % AnalyzeSDFilesData.pl --datafields MolWeight -o -r NewSample1
200 Sample1.sdf
201
202 To calculate all available statistics for MolWeight data field and all
203 data field pairs, and generate NewSample1DescriptiveStatisticsAll.csv,
204 NewSample1CorrelationMatrix.csv, NewSample1CorrelationMatrix.csv, and
205 NewSample1MolWeightFrequencyAnalysis.csv files, type:
206
207 % AnalyzeSDFilesData.pl -m DescriptiveStatisticsAll --datafields
208 MolWeight -o --datafieldpairs AllPairs -r NewSample1 Sample1.sdf
209
210 To compute frequency distribution of MolWeight data field into five bins
211 and generate NewSample1MolWeightFrequencyAnalysis.csv, type:
212
213 % AnalyzeSDFilesData.pl -m Frequency --frequencybins 5 --datafields
214 MolWeight -o -r NewSample1 Sample1.sdf
215
216 To compute frequency distribution of data in MolWeight data field into
217 specified bin range values, and generate
218 NewSample1MolWeightFrequencyAnalysis.csv, type:
219
220 % AnalyzeSDFilesData.pl -m Frequency --frequencybins "100,200,400"
221 --datafields MolWeight -o -r NewSample1 Sample1.sdf
222
223 To calculate all available statistics for data in all data fields and
224 pairs, type:
225
226 % AnalyzeSDFilesData.pl -m All --datafields All --datafieldpairs
227 AllPairs -o -r NewSample1 Sample1.sdf
228
229 AUTHOR
230 Manish Sud <msud@san.rr.com>
231
232 SEE ALSO
233 FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl,
234 MergeTextFilesWithSD.pl
235
236 COPYRIGHT
237 Copyright (C) 2015 Manish Sud. All rights reserved.
238
239 This file is part of MayaChemTools.
240
241 MayaChemTools is free software; you can redistribute it and/or modify it
242 under the terms of the GNU Lesser General Public License as published by
243 the Free Software Foundation; either version 3 of the License, or (at
244 your option) any later version.
245