comparison docs/scripts/html/AnalyzeTextFilesData.html @ 0:4816e4a8ae95 draft default tip

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 09:23:18 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4816e4a8ae95
1 <html>
2 <head>
3 <title>MayaChemTools:Documentation:AnalyzeTextFilesData.pl</title>
4 <meta http-equiv="content-type" content="text/html;charset=utf-8">
5 <link rel="stylesheet" type="text/css" href="../../css/MayaChemTools.css">
6 </head>
7 <body leftmargin="20" rightmargin="20" topmargin="10" bottommargin="10">
8 <br/>
9 <center>
10 <a href="http://www.mayachemtools.org" title="MayaChemTools Home"><img src="../../images/MayaChemToolsLogo.gif" border="0" alt="MayaChemTools"></a>
11 </center>
12 <br/>
13 <div class="DocNav">
14 <table width="100%" border=0 cellpadding=0 cellspacing=2>
15 <tr align="left" valign="top"><td width="33%" align="left"><a href="./AnalyzeSequenceFilesData.html" title="AnalyzeSequenceFilesData.html">Previous</a>&nbsp;&nbsp;<a href="./index.html" title="Table of Contents">TOC</a>&nbsp;&nbsp;<a href="./AtomNeighborhoodsFingerprints.html" title="AtomNeighborhoodsFingerprints.html">Next</a></td><td width="34%" align="middle"><strong>AnalyzeTextFilesData.pl</strong></td><td width="33%" align="right"><a href="././code/AnalyzeTextFilesData.html" title="View source code">Code</a>&nbsp;|&nbsp;<a href="./../pdf/AnalyzeTextFilesData.pdf" title="PDF US Letter Size">PDF</a>&nbsp;|&nbsp;<a href="./../pdfgreen/AnalyzeTextFilesData.pdf" title="PDF US Letter Size with narrow margins: www.changethemargins.com">PDFGreen</a>&nbsp;|&nbsp;<a href="./../pdfa4/AnalyzeTextFilesData.pdf" title="PDF A4 Size">PDFA4</a>&nbsp;|&nbsp;<a href="./../pdfa4green/AnalyzeTextFilesData.pdf" title="PDF A4 Size with narrow margins: www.changethemargins.com">PDFA4Green</a></td></tr>
16 </table>
17 </div>
18 <p>
19 </p>
20 <h2>NAME</h2>
21 <p>AnalyzeTextFilesData.pl - Analyze numerical coulmn data in TextFile(s)</p>
22 <p>
23 </p>
24 <h2>SYNOPSIS</h2>
25 <p>AnalyzeTextFilesData.pl TextFile(s)...</p>
26 <p>AnalyzeTextFilesData.pl [<strong>-c, --colmode</strong> colnum | collabel] [<strong>--columns</strong> &quot;colnum,[colnum,...]&quot; | &quot;collabel,[collabel,...]&quot; | All]
27 [<strong>--columnpairs</strong> &quot;colnum,colnum,[colnum,colnum]...&quot; | &quot;collabel,collabel,[collabel,collabel]...&quot; | AllPairs]
28 [<strong>-d, --detail</strong> infolevel] [<strong>-f, --fast</strong>] [<strong>--frequencybins</strong> number | &quot;number,number,[number,...]&quot;] [<strong>-h, --help</strong>]
29 [<strong>--indelim</strong> comma | semicolon] [<strong>--klargest</strong> number] [<strong>--ksmallest</strong> number]
30 [<strong>-m, --mode</strong> DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All | &quot;function1, [function2,...]&quot;]
31 [<strong>-o, --overwrite</strong>] [<strong>--outdelim</strong> comma | tab | semicolon] [<strong>-p, --precision</strong> number]
32 [<strong>-q, --quote</strong> yes | no] [<strong>-r, --root</strong> rootname] [<strong>--trimfraction</strong> number] [<strong>-w, --workingdir</strong> dirname] TextFiles(s)...</p>
33 <p>
34 </p>
35 <h2>DESCRIPTION</h2>
36 <p>Anaylze numerical column data in <em>TextFile(s)</em> using a combination of various statistical
37 functions; Non-numerical values are simply ignored. For <em>Correlation, RSquare, and Covariance</em>
38 analysis, the count of valid values in specifed column pair must be same; otherwise, column
39 pair is ignored. The file names are separated by space. The valid file extensions are <em>.csv</em>
40 and <em>.tsv</em> for comma/semicolon and tab delimited text files respectively. All other
41 file names are ignored. All the text files in a current directory can be specified by
42 <em>*.csv</em>, <em>*.tsv</em>, or the current directory name. The <strong>--indelim</strong> option determines
43 the format of <em>TextFile(s)</em>. Any file which doesn't correspond to the format indicated
44 by <strong>--indelim</strong> option is ignored.</p>
45 <p>
46 </p>
47 <h2>OPTIONS</h2>
48 <dl>
49 <dt><strong><strong>-c, --colmode</strong> <em>colnum | collabel</em></strong></dt>
50 <dd>
51 <p>Specify how columns are identified in TextFile(s): using column number or column
52 label. Possible values: <em>colnum or collabel</em>. Default value: <em>colnum</em>.</p>
53 </dd>
54 <dt><strong><strong>--columns</strong> <em>&quot;colnum,[colnum,...]&quot; | &quot;collabel,[collabel]...&quot; | All</em></strong></dt>
55 <dd>
56 <p>This value is mode specific. It's a list of comma delimited columns to use
57 for data analysis. Default value: <em>First column</em>.</p>
58 <p>This value is ignored during <em>Correlation/Pearson Correlation</em> and <em>Covariance</em>
59 data analysis; <strong>-coulmnparis</strong> option is used instead.</p>
60 <p>For <em>colnum</em> value of <strong>-c, --colmode</strong> option, input values format is:
61 <em>colnum,colnum,...</em>. Example:</p>
62 <div class="OptionsBox">
63 1,3,5</div>
64 <p>For <em>collabel</em> value of <strong>-c, --colmode</strong> option, input values format is:
65 <em>collabel,collabel,..</em>. Example:</p>
66 <div class="OptionsBox">
67 ALogP,MolWeight,EC50</div>
68 </dd>
69 <dt><strong><strong>--columnpairs</strong> <em>&quot;colnum,colnum,[colnum,colnum,...]&quot; | &quot;collabel,collabel,[collabel,collabel,...]&quot; | AllPairs</em></strong></dt>
70 <dd>
71 <p>This value is mode specific and is only used for <em>Correlation, PearsonCorrelation, or
72 Covariance</em> value of <strong>-m, --mode</strong> option. It is a comma delimited list of column pairs
73 to use for data analysis during <em>Correlation</em> and <em>Covariance</em> calculations. Default value:
74 <em>First column, Second column</em>.</p>
75 <p>For <em>colnum</em> value of <strong>-c, --colmode</strong> option, input values format is:
76 <em>colnum,colnum,[colnum,colnum]...</em>. Example:</p>
77 <div class="OptionsBox">
78 1,3,5,6,1,6</div>
79 <p>For <em>collabel</em> value of <strong>-c, --colmode</strong> option, input values format is:
80 <em>collabel,collabel,[collabel,collabel]..</em>. Example:</p>
81 <div class="OptionsBox">
82 MolWeight,EC50,NumN+O,PSA</div>
83 <p>For <em>AllPairs</em> value of <strong>--columnparis</strong> option, all column pairs are used for <em>Correlation</em>
84 and <em>Covariance</em> calculations.</p>
85 </dd>
86 <dt><strong><strong>-d, --detail</strong> <em>infolevel</em></strong></dt>
87 <dd>
88 <p>Level of information to print about column values being ignored. Default: <em>1</em>. Possible values:
89 1, 2, 3, or 4.</p>
90 </dd>
91 <dt><strong><strong>-f, --fast</strong></strong></dt>
92 <dd>
93 <p>In this mode, all the columns specified for analysis are assumed to contain numerical
94 data and no checking is performed before analysis. By default, only numerical data is
95 used for analysis; other types of column data is ignored.</p>
96 </dd>
97 <dt><strong><strong>--frequencybins</strong> <em>number | &quot;number,number,[number,...]&quot;</em></strong></dt>
98 <dd>
99 <p>Specify number of bins or bin range to use for frequency analysis. Default value: <em>10</em></p>
100 <p>Number of bins value along with the smallest and largest value for a column is used to
101 group the column values into different groups.</p>
102 <p>The bin range list is used to group values for a column into different groups; It must contain
103 values in ascending order. Examples:</p>
104 <div class="OptionsBox">
105 10,20,30
106 <br/> 0.1,0.2,0.3,0.4,0.5</div>
107 <p>The frequency value calculated for a specific bin corresponds to all the column values
108 which are greater than the previous bin value and less than or equal to the current bin value.</p>
109 </dd>
110 <dt><strong><strong>-h, --help</strong></strong></dt>
111 <dd>
112 <p>Print this help message.</p>
113 </dd>
114 <dt><strong><strong>--indelim</strong> <em>comma | semicolon</em></strong></dt>
115 <dd>
116 <p>Input delimiter for CSV <em>TextFile(s)</em>. Possible values: <em>comma or semicolon</em>.
117 Default value: <em>comma</em>. For TSV files, this option is ignored and <em>tab</em> is used as a
118 delimiter.</p>
119 </dd>
120 <dt><strong><strong>--klargest</strong> <em>number</em></strong></dt>
121 <dd>
122 <p>Kth largest value to find by <em>KLargest</em> function. Default value: <em>2</em> Valid values: positive
123 integers.</p>
124 </dd>
125 <dt><strong><strong>--ksmallest</strong> <em>number</em></strong></dt>
126 <dd>
127 <p>Kth smallest value to find by <em>KSmallest</em> function. Default value: <em>2</em>. Valid values: positive
128 integers.</p>
129 </dd>
130 <dt><strong><strong>-m, --mode</strong> <em>DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All | &quot;function1, [function2,...]&quot;</em></strong></dt>
131 <dd>
132 <p>Specify how to analyze data in TextFile(s): calculate basic or all descriptive statistics; or
133 use a comma delimited list of supported statistical functions. Possible values:
134 <em>DescriptiveStatisticsBasic | DescriptiveStatisticsAll | &quot;function1,[function2]...&quot;</em>. Default
135 value: <em>DescriptiveStatisticsBasic</em></p>
136 <p><em>DescriptiveStatisticsBasic</em> includes these functions: <em>Count, Maximum, Minimum, Mean,
137 Median, Sum, StandardDeviation, StandardError, Variance</em>.</p>
138 <p><em>DescriptiveStatisticsAll</em>, in addition to <em>DescriptiveStatisticsBasic</em> functions, includes:
139 <em>GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Mode, RSquare,
140 Skewness, TrimMean</em>.</p>
141 <p><em>All</em> uses complete list of supported functions: <em>Average, AverageDeviation, Correlation,
142 Count, Covariance, GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis,
143 Maximum, Minimum, Mean, Median, Mode, RSquare, Skewness, Sum,
144 SumOfSquares, StandardDeviation, StandardDeviationN, StandardError, StandardScores,
145 StandardScoresN, TrimMean, Variance, VarianceN</em>. The function names ending with N
146 calculate corresponding values assuming an entire population instead of a population sample.</p>
147 <p>Here are the formulas for these functions:</p>
148 <p>Average: See Mean</p>
149 <p>AverageDeviation: SUM( ABS(x[i] - Xmean) ) / n</p>
150 <p>Correlation: See Pearson Correlation</p>
151 <p>Covariance: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / n</p>
152 <p>GeometricMean: NthROOT( PRODUCT(x[i]) )</p>
153 <p>HarmonicMean: 1 / ( SUM(1/x[i]) / n )</p>
154 <p>Mean: SUM( x[i] ) / n</p>
155 <p>Median: Xsorted[(n - 1)/2 + 1] for even values of n; (Xsorted[n/2] + Xsorted[n/2 + 1])/2
156 for odd values of n.</p>
157 <p>Kurtosis: [ {n(n + 1)/(n - 1)(n - 2)(n - 3)} SUM{ ((x[i] - Xmean)/STDDEV)^4 } ] -
158 {3((n - 1)^2)}/{(n - 2)(n-3)}</p>
159 <p>PearsonCorrelation: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / SQRT( SUM( (x[i] - Xmean)^2 )
160 (SUM( (y[i] - Ymean)^2 )) )</p>
161 <p>RSquare: PearsonCorrelation^2</p>
162 <p>Skewness: {n/(n - 1)(n - 2)} SUM{ ((x[i] - Xmean)/STDDEV)^3 }</p>
163 <p>StandardDeviation: SQRT ( SUM( (x[i] - Mean)^2 ) / (n - 1) )</p>
164 <p>StandardDeviationN: SQRT ( SUM( (x[i] - Mean)^2 ) / n )</p>
165 <p>StandardError: StandardDeviation / SQRT( n )</p>
166 <p>StandardScore: (x[i] - Mean) / (n - 1)</p>
167 <p>StandardScoreN: (x[i] - Mean) / n</p>
168 <p>Variance: SUM( (x[i] - Xmean)^2 / (n - 1) )</p>
169 <p>VarianceN: SUM( (x[i] - Xmean)^2 / n )</p>
170 </dd>
171 <dt><strong><strong>-o, --overwrite</strong></strong></dt>
172 <dd>
173 <p>Overwrite existing files.</p>
174 </dd>
175 <dt><strong><strong>--outdelim</strong> <em>comma | tab | semicolon</em></strong></dt>
176 <dd>
177 <p>Output text file delimiter. Possible values: <em>comma, tab, or semicolon</em>
178 Default value: <em>comma</em>.</p>
179 </dd>
180 <dt><strong><strong>-p, --precision</strong> <em>number</em></strong></dt>
181 <dd>
182 <p>Precision of calculated values in the output file. Default: up to <em>2</em> decimal places.
183 Valid values: positive integers.</p>
184 </dd>
185 <dt><strong><strong>-q, --quote</strong> <em>yes | no</em></strong></dt>
186 <dd>
187 <p>Put quotes around column values in output text file. Possible values: <em>yes or
188 no</em>. Default value: <em>yes</em>.</p>
189 </dd>
190 <dt><strong><strong>-r, --root</strong> <em>rootname</em></strong></dt>
191 <dd>
192 <p>New text file name is generated using the root: &lt;Root&gt;.&lt;Ext&gt;. Default new file
193 name: &lt;InitialTextFileName&gt;&lt;Mode&gt;.&lt;Ext&gt;. Based on the specified analysis,
194 &lt;Mode&gt; corresponds to one of these values: DescriptiveStatisticsBasic,
195 DescriptiveStatisticsAll, AllStatistics, SpecifiedStatistics, Covariance, Correlation,
196 Frequency, or StandardScores. The csv, and tsv &lt;Ext&gt; values are used for
197 comma/semicolon, and tab delimited text files respectively. This option is ignored for
198 multiple input files.</p>
199 </dd>
200 <dt><strong><strong>--trimfraction</strong> <em>number</em></strong></dt>
201 <dd>
202 <p>Fraction of data to exclude from the top and bottom of the data set during
203 <em>TrimMean</em> calculation. Default value: <em>0.1</em>. Valid values: &gt; 0 and &lt; 1.</p>
204 </dd>
205 <dt><strong><strong>-w --workingdir</strong> <em>text</em></strong></dt>
206 <dd>
207 <p>Location of working directory. Default: current directory.</p>
208 </dd>
209 </dl>
210 <p>
211 </p>
212 <h2>EXAMPLES</h2>
213 <p>To calculate basic statistics for data in first column and generate a
214 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
215 <div class="ExampleBox">
216 % AnalyzeTextFilesData.pl -o -r NewSample1 Sample1.csv</div>
217 <p>To calculate basic statistics for data in third column and generate a
218 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
219 <div class="ExampleBox">
220 % AnalyzeTextFilesData.pl --columns 3 -o -r NewSample1 Sample1.csv</div>
221 <p>To calculate basic statistics for data in MolWeight column and generate a
222 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
223 <div class="ExampleBox">
224 % AnalyzeTextFilesData.pl -colmode collabel --columns MolWeight -o
225 -r NewSample1 Sample1.csv</div>
226 <p>To calculate all available statistics for data in third column and all column pairs,
227 and generate NewSample1DescriptiveStatisticsAll.csv, NewSample1CorrelationMatrix.csv,
228 NewSample1CorrelationMatrix.csv, and NewSample1MolWeightFrequencyAnalysis.csv files,
229 type:</p>
230 <div class="ExampleBox">
231 % AnalyzeTextFilesData.pl -m DescriptiveStatisticsAll --columns 3 -o
232 --columnpairs AllPairs -r NewSample1 Sample1.csv</div>
233 <p>To compute frequency distribution of data in third column into five bins and
234 generate NewSample1MolWeightFrequencyAnalysis.csv, type:</p>
235 <div class="ExampleBox">
236 % AnalyzeTextFilesData.pl -m Frequency --frequencybins 5 --columns 3
237 -o -r NewSample1 Sample1.csv</div>
238 <p>To compute frequency distribution of data in third column into specified bin range
239 values, and generate NewSample1MolWeightFrequencyAnalysis.csv, type:</p>
240 <div class="ExampleBox">
241 % AnalyzeTextFilesData.pl -m Frequency --frequencybins &quot;100,200,400&quot;
242 --columns 3 -o -r NewSample1 Sample1.csv</div>
243 <p>To calculate all available statistics for data in all columns and column pairs, type:</p>
244 <div class="ExampleBox">
245 % AnalyzeTextFilesData.pl -m All --columns All --columnpairs
246 AllPairs -o -r NewSample1 Sample1.csv</div>
247 <p>
248 </p>
249 <h2>AUTHOR</h2>
250 <p><a href="mailto:msud@san.rr.com">Manish Sud</a></p>
251 <p>
252 </p>
253 <h2>SEE ALSO</h2>
254 <p><a href="./JoinTextFiles.html">JoinTextFiles.pl</a>,&nbsp<a href="./MergeTextFilesWithSD.html">MergeTextFilesWithSD.pl</a>,&nbsp<a href="./ModifyTextFilesFormat.html">ModifyTextFilesFormat.pl</a>,&nbsp<a href="./SplitTextFiles.html">SplitTextFiles.pl</a>,&nbsp<a href="./TextFilesToHTML.html">TextFilesToHTML.pl</a>
255 </p>
256 <p>
257 </p>
258 <h2>COPYRIGHT</h2>
259 <p>Copyright (C) 2015 Manish Sud. All rights reserved.</p>
260 <p>This file is part of MayaChemTools.</p>
261 <p>MayaChemTools is free software; you can redistribute it and/or modify it under
262 the terms of the GNU Lesser General Public License as published by the Free
263 Software Foundation; either version 3 of the License, or (at your option)
264 any later version.</p>
265 <p>&nbsp</p><p>&nbsp</p><div class="DocNav">
266 <table width="100%" border=0 cellpadding=0 cellspacing=2>
267 <tr align="left" valign="top"><td width="33%" align="left"><a href="./AnalyzeSequenceFilesData.html" title="AnalyzeSequenceFilesData.html">Previous</a>&nbsp;&nbsp;<a href="./index.html" title="Table of Contents">TOC</a>&nbsp;&nbsp;<a href="./AtomNeighborhoodsFingerprints.html" title="AtomNeighborhoodsFingerprints.html">Next</a></td><td width="34%" align="middle"><strong>March 29, 2015</strong></td><td width="33%" align="right"><strong>AnalyzeTextFilesData.pl</strong></td></tr>
268 </table>
269 </div>
270 <br />
271 <center>
272 <img src="../../images/h2o2.png">
273 </center>
274 </body>
275 </html>