comparison mayachemtools/docs/scripts/html/AnalyzeSDFilesData.html @ 0:73ae111cf86f draft

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 11:55:01 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:73ae111cf86f
1 <html>
2 <head>
3 <title>MayaChemTools:Documentation:AnalyzeSDFilesData.pl</title>
4 <meta http-equiv="content-type" content="text/html;charset=utf-8">
5 <link rel="stylesheet" type="text/css" href="../../css/MayaChemTools.css">
6 </head>
7 <body leftmargin="20" rightmargin="20" topmargin="10" bottommargin="10">
8 <br/>
9 <center>
10 <a href="http://www.mayachemtools.org" title="MayaChemTools Home"><img src="../../images/MayaChemToolsLogo.gif" border="0" alt="MayaChemTools"></a>
11 </center>
12 <br/>
13 <div class="DocNav">
14 <table width="100%" border=0 cellpadding=0 cellspacing=2>
15 <tr align="left" valign="top"><td width="33%" align="left"><a href="./README.html" title="README.html">Previous</a>&nbsp;&nbsp;<a href="./index.html" title="Table of Contents">TOC</a>&nbsp;&nbsp;<a href="./AnalyzeSequenceFilesData.html" title="AnalyzeSequenceFilesData.html">Next</a></td><td width="34%" align="middle"><strong>AnalyzeSDFilesData.pl</strong></td><td width="33%" align="right"><a href="././code/AnalyzeSDFilesData.html" title="View source code">Code</a>&nbsp;|&nbsp;<a href="./../pdf/AnalyzeSDFilesData.pdf" title="PDF US Letter Size">PDF</a>&nbsp;|&nbsp;<a href="./../pdfgreen/AnalyzeSDFilesData.pdf" title="PDF US Letter Size with narrow margins: www.changethemargins.com">PDFGreen</a>&nbsp;|&nbsp;<a href="./../pdfa4/AnalyzeSDFilesData.pdf" title="PDF A4 Size">PDFA4</a>&nbsp;|&nbsp;<a href="./../pdfa4green/AnalyzeSDFilesData.pdf" title="PDF A4 Size with narrow margins: www.changethemargins.com">PDFA4Green</a></td></tr>
16 </table>
17 </div>
18 <p>
19 </p>
20 <h2>NAME</h2>
21 <p>AnalyzeSDFilesData.pl - Analyze numerical data field values in SDFile(s)</p>
22 <p>
23 </p>
24 <h2>SYNOPSIS</h2>
25 <p>AnalyzeSDFilesData.pl SDFile(s)...</p>
26 <p>AnalyzeSDFilesData.pl [<strong>--datafields</strong> &quot;fieldlabel,[fieldlabel,...]&quot; | All]
27 [<strong>--datafieldpairs</strong> &quot;fieldlabel,fieldlabel,[fieldlabel,fieldlabel,...]&quot; | AllPairs] [<strong>-d, --detail</strong> infolevel]
28 [<strong>-f, --fast</strong>] [<strong>--frequencybins</strong> number | &quot;number,number,[number,...]&quot;]
29 [<strong>-h, --help</strong>] [<strong>--klargest</strong> number] [<strong>--ksmallest</strong> number]
30 [<strong>-m, --mode</strong> DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All | &quot;function1, [function2,...]&quot;]
31 [<strong>--trimfraction</strong> number] [<strong>-w, --workingdir</strong> dirname] SDFiles(s)...</p>
32 <p>
33 </p>
34 <h2>DESCRIPTION</h2>
35 <p>Analyze numerical data field values in <em>SDFile(s)</em> using a combination of various statistical
36 functions; Non-numerical values are simply ignored. For <em>Correlation, RSquare, and
37 Covariance</em> analysis, the count of valid values in specified data field pairs must be same;
38 otherwise, column data field pair is ignored. The file names are separated by space.The valid file
39 extensions are <em>.sdf</em> and <em>.sd</em>. All other file names are ignored. All the SD files in a
40 current directory can be specified either by <em>*.sdf</em> or the current directory name.</p>
41 <p>
42 </p>
43 <h2>OPTIONS</h2>
44 <dl>
45 <dt><strong><strong>--datafields</strong> <em>&quot;fieldlabel,[fieldlabel,...]&quot; | Common | All</em></strong></dt>
46 <dd>
47 <p>Data fields to use for analysis. Possible values: list of comma separated data field
48 labels, data fields common to all records, or all data fields. Default value: <em>Common</em>.
49 Examples:</p>
50 <div class="OptionsBox">
51 ALogP,MolWeight,EC50
52 <br/> &quot;MolWeight,PSA&quot;</div>
53 </dd>
54 <dt><strong><strong>--datafieldpairs</strong> <em>&quot;fieldlabel,fieldlabel,[fieldlabel,fieldlabel,...]&quot; | CommonPairs | AllPairs</em></strong></dt>
55 <dd>
56 <p>This value is mode specific and is only used for <em>Correlation, PearsonCorrelation, or
57 Covariance</em> value of <strong>-m, --mode</strong> option. It specifies data field label pairs to use
58 for data analysis during <em>Correlation</em> and <em>Covariance</em> calculations. Possible values:
59 comma delimited list of data field label pairs, data field label pairs common to all records,
60 or all data field pairs. Default value:<em>CommonPairs</em>. Example:</p>
61 <div class="OptionsBox">
62 MolWeight,EC50,NumN+O,PSA</div>
63 <p>For <em>AllPairs</em> value of <strong>--datafieldpairs</strong> option, all data field label pairs are used for
64 <em>Correlation</em> and <em>Covariance</em> calculations.</p>
65 </dd>
66 <dt><strong><strong>-d, --detail</strong> <em>infolevel</em></strong></dt>
67 <dd>
68 <p>Level of information to print about column values being ignored. Default: <em>0</em>. Possible values:
69 0, 1, 2, 3, or 4.</p>
70 </dd>
71 <dt><strong><strong>-f, --fast</strong></strong></dt>
72 <dd>
73 <p>In this mode, all the data field values specified for analysis are assumed to contain numerical
74 data and no checking is performed before analysis. By default, only numerical data is
75 used for analysis; other types of column data is ignored.</p>
76 </dd>
77 <dt><strong><strong>--frequencybins</strong> <em>number | &quot;number,number,[number,...]&quot;</em></strong></dt>
78 <dd>
79 <p>Specify number of bins or bin range to use for frequency analysis. Default value: <em>10</em></p>
80 <p>Number of bins value along with the smallest and largest value for a column is used to
81 group the column values into different groups.</p>
82 <p>The bin range list is used to group values for a column into different groups; It must contain
83 values in ascending order. Examples:</p>
84 <div class="OptionsBox">
85 10,20,30
86 <br/> 0.1,0.2,0.3,0.4,0.5</div>
87 <p>The frequency value calculated for a specific bin corresponds to all the column values
88 which are greater than the previous bin value and less than or equal to the current bin value.</p>
89 </dd>
90 <dt><strong><strong>-h, --help</strong></strong></dt>
91 <dd>
92 <p>Print this help message.</p>
93 </dd>
94 <dt><strong><strong>--klargest</strong> <em>number</em></strong></dt>
95 <dd>
96 <p>Kth largest value to find by <em>KLargest</em> function. Default value: <em>2</em>. Valid values: positive
97 integers.</p>
98 </dd>
99 <dt><strong><strong>--ksmallest</strong> <em>number</em></strong></dt>
100 <dd>
101 <p>Kth smallest value to find by <em>KSmallest</em> function. Default values: <em>2</em>. Valid values: positive
102 integers.</p>
103 </dd>
104 <dt><strong><strong>-m, --mode</strong> <em>DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All | &quot;function1, [function2,...]&quot;</em></strong></dt>
105 <dd>
106 <p>Specify how to analyze data in SDFile(s): calculate basic or all descriptive statistics; or
107 use a comma delimited list of supported statistical functions. Possible values:
108 <em>DescriptiveStatisticsBasic | DescriptiveStatisticsAll | &quot;function1,[function2]...&quot;</em>. Default
109 value: <em>DescriptiveStatisticsBasic</em></p>
110 <p><em>DescriptiveStatisticsBasic</em> includes these functions: <em>Count, Maximum, Minimum, Mean,
111 Median, Sum, StandardDeviation, StandardError, Variance</em>.</p>
112 <p><em>DescriptiveStatisticsAll</em>, in addition to <em>DescriptiveStatisticsBasic</em> functions, includes:
113 <em>GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Mode, RSquare,
114 Skewness, TrimMean</em>.</p>
115 <p><em>All</em> uses complete list of supported functions: <em>Average, AverageDeviation, Correlation,
116 Count, Covariance, GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis,
117 Maximum, Minimum, Mean, Median, Mode, RSquare, Skewness, Sum,
118 SumOfSquares, StandardDeviation, StandardDeviationN, StandardError, StandardScores,
119 StandardScoresN, TrimMean, Variance, VarianceN</em>. The function names ending with N
120 calculate corresponding values assuming an entire population instead of a population sample.
121 Here are the formulas for these functions:</p>
122 <p>Average: See Mean</p>
123 <p>AverageDeviation: SUM( ABS(x[i] - Xmean) ) / n</p>
124 <p>Correlation: See Pearson Correlation</p>
125 <p>Covariance: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / n</p>
126 <p>GeometricMean: NthROOT( PRODUCT(x[i]) )</p>
127 <p>HarmonicMean: 1 / ( SUM(1/x[i]) / n )</p>
128 <p>Mean: SUM( x[i] ) / n</p>
129 <p>Median: Xsorted[(n - 1)/2 + 1] for even values of n; (Xsorted[n/2] + Xsorted[n/2 + 1])/2
130 for odd values of n.</p>
131 <p>Kurtosis: [ {n(n + 1)/(n - 1)(n - 2)(n - 3)} SUM{ ((x[i] - Xmean)/STDDEV)^4 } ] -
132 {3((n - 1)^2)}/{(n - 2)(n-3)}</p>
133 <p>PearsonCorrelation: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / SQRT( SUM( (x[i] - Xmean)^2 )
134 (SUM( (y[i] - Ymean)^2 )) )</p>
135 <p>RSquare: PearsonCorrelation^2</p>
136 <p>Skewness: {n/(n - 1)(n - 2)} SUM{ ((x[i] - Xmean)/STDDEV)^3 }</p>
137 <p>StandardDeviation: SQRT ( SUM( (x[i] - Mean)^2 ) / (n - 1) )</p>
138 <p>StandardDeviationN: SQRT ( SUM( (x[i] - Mean)^2 ) / n )</p>
139 <p>StandardError: StandardDeviation / SQRT( n )</p>
140 <p>StandardScore: (x[i] - Mean) / (n - 1)</p>
141 <p>StandardScoreN: (x[i] - Mean) / n</p>
142 <p>Variance: SUM( (x[i] - Xmean)^2 / (n - 1) )</p>
143 <p>VarianceN: SUM( (x[i] - Xmean)^2 / n )</p>
144 </dd>
145 <dt><strong><strong>-o, --overwrite</strong></strong></dt>
146 <dd>
147 <p>Overwrite existing files.</p>
148 </dd>
149 <dt><strong><strong>--outdelim</strong> <em>comma | tab | semicolon</em></strong></dt>
150 <dd>
151 <p>Output text file delimiter. Possible values: <em>comma, tab, or semicolon</em>
152 Default value: <em>comma</em>.</p>
153 </dd>
154 <dt><strong><strong>-p, --precision</strong> <em>number</em></strong></dt>
155 <dd>
156 <p>Precision of calculated values in the output file. Default: up to <em>2</em> decimal places.
157 Valid values: positive integers.</p>
158 </dd>
159 <dt><strong><strong>-q, --quote</strong> <em>yes | no</em></strong></dt>
160 <dd>
161 <p>Put quotes around column values in output text file. Possible values: <em>yes or
162 no</em>. Default value: <em>yes</em>.</p>
163 </dd>
164 <dt><strong><strong>-r, --root</strong> <em>rootname</em></strong></dt>
165 <dd>
166 <p>New text file name is generated using the root: &lt;Root&gt;.&lt;Ext&gt;. Default new file
167 name: &lt;InitialSDFileName&gt;&lt;Mode&gt;.&lt;Ext&gt;. Based on the specified analysis,
168 &lt;Mode&gt; corresponds to one of these values: DescriptiveStatisticsBasic,
169 DescriptiveStatisticsAll, AllStatistics, SpecifiedStatistics, Covariance, Correlation,
170 Frequency, or StandardScores. The csv, and tsv &lt;Ext&gt; values are used for
171 comma/semicolon, and tab delimited text files respectively. This option is ignored for
172 multiple input files.</p>
173 </dd>
174 <dt><strong><strong>--trimfraction</strong> <em>number</em></strong></dt>
175 <dd>
176 <p>Fraction of data to exclude from the top and bottom of the data set during
177 <em>TrimMean</em> calculation. Default value: <em>0.1</em> Valid values: &gt; 0 and &lt; 1.</p>
178 </dd>
179 <dt><strong><strong>-w --workingdir</strong> <em>text</em></strong></dt>
180 <dd>
181 <p>Location of working directory. Default: current directory.</p>
182 </dd>
183 </dl>
184 <p>
185 </p>
186 <h2>EXAMPLES</h2>
187 <p>To calculate basic statistics for data in all common data fields and generate a
188 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
189 <div class="ExampleBox">
190 % AnalyzeSDFilesData.pl -o -r NewSample1 Sample1.sdf</div>
191 <p>To calculate basic statistics for MolWeight data field and generate a
192 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
193 <div class="ExampleBox">
194 % AnalyzeSDFilesData.pl --datafields MolWeight -o -r NewSample1
195 Sample1.sdf</div>
196 <p>To calculate all available statistics for MolWeight data field and all data field pairs,
197 and generate NewSample1DescriptiveStatisticsAll.csv, NewSample1CorrelationMatrix.csv,
198 NewSample1CorrelationMatrix.csv, and NewSample1MolWeightFrequencyAnalysis.csv
199 files, type:</p>
200 <div class="ExampleBox">
201 % AnalyzeSDFilesData.pl -m DescriptiveStatisticsAll --datafields
202 MolWeight -o --datafieldpairs AllPairs -r NewSample1 Sample1.sdf</div>
203 <p>To compute frequency distribution of MolWeight data field into five bins and
204 generate NewSample1MolWeightFrequencyAnalysis.csv, type:</p>
205 <div class="ExampleBox">
206 % AnalyzeSDFilesData.pl -m Frequency --frequencybins 5 --datafields
207 MolWeight -o -r NewSample1 Sample1.sdf</div>
208 <p>To compute frequency distribution of data in MolWeight data field into specified bin range
209 values, and generate NewSample1MolWeightFrequencyAnalysis.csv, type:</p>
210 <div class="ExampleBox">
211 % AnalyzeSDFilesData.pl -m Frequency --frequencybins &quot;100,200,400&quot;
212 --datafields MolWeight -o -r NewSample1 Sample1.sdf</div>
213 <p>To calculate all available statistics for data in all data fields and pairs, type:</p>
214 <div class="ExampleBox">
215 % AnalyzeSDFilesData.pl -m All --datafields All --datafieldpairs
216 AllPairs -o -r NewSample1 Sample1.sdf</div>
217 <p>
218 </p>
219 <h2>AUTHOR</h2>
220 <p><a href="mailto:msud@san.rr.com">Manish Sud</a></p>
221 <p>
222 </p>
223 <h2>SEE ALSO</h2>
224 <p><a href="./FilterSDFiles.html">FilterSDFiles.pl</a>,&nbsp<a href="./InfoSDFiles.html">InfoSDFiles.pl</a>,&nbsp<a href="./SplitSDFiles.html">SplitSDFiles.pl</a>,&nbsp<a href="./MergeTextFilesWithSD.html">MergeTextFilesWithSD.pl</a>
225 </p>
226 <p>
227 </p>
228 <h2>COPYRIGHT</h2>
229 <p>Copyright (C) 2015 Manish Sud. All rights reserved.</p>
230 <p>This file is part of MayaChemTools.</p>
231 <p>MayaChemTools is free software; you can redistribute it and/or modify it under
232 the terms of the GNU Lesser General Public License as published by the Free
233 Software Foundation; either version 3 of the License, or (at your option)
234 any later version.</p>
235 <p>&nbsp</p><p>&nbsp</p><div class="DocNav">
236 <table width="100%" border=0 cellpadding=0 cellspacing=2>
237 <tr align="left" valign="top"><td width="33%" align="left"><a href="./README.html" title="README.html">Previous</a>&nbsp;&nbsp;<a href="./index.html" title="Table of Contents">TOC</a>&nbsp;&nbsp;<a href="./AnalyzeSequenceFilesData.html" title="AnalyzeSequenceFilesData.html">Next</a></td><td width="34%" align="middle"><strong>March 29, 2015</strong></td><td width="33%" align="right"><strong>AnalyzeSDFilesData.pl</strong></td></tr>
238 </table>
239 </div>
240 <br />
241 <center>
242 <img src="../../images/h2o2.png">
243 </center>
244 </body>
245 </html>