0
|
1 <html>
|
|
2 <head>
|
|
3 <title>MayaChemTools:Documentation:AnalyzeTextFilesData.pl</title>
|
|
4 <meta http-equiv="content-type" content="text/html;charset=utf-8">
|
|
5 <link rel="stylesheet" type="text/css" href="../../css/MayaChemTools.css">
|
|
6 </head>
|
|
7 <body leftmargin="20" rightmargin="20" topmargin="10" bottommargin="10">
|
|
8 <br/>
|
|
9 <center>
|
|
10 <a href="http://www.mayachemtools.org" title="MayaChemTools Home"><img src="../../images/MayaChemToolsLogo.gif" border="0" alt="MayaChemTools"></a>
|
|
11 </center>
|
|
12 <br/>
|
|
13 <div class="DocNav">
|
|
14 <table width="100%" border=0 cellpadding=0 cellspacing=2>
|
|
15 <tr align="left" valign="top"><td width="33%" align="left"><a href="./AnalyzeSequenceFilesData.html" title="AnalyzeSequenceFilesData.html">Previous</a> <a href="./index.html" title="Table of Contents">TOC</a> <a href="./AtomNeighborhoodsFingerprints.html" title="AtomNeighborhoodsFingerprints.html">Next</a></td><td width="34%" align="middle"><strong>AnalyzeTextFilesData.pl</strong></td><td width="33%" align="right"><a href="././code/AnalyzeTextFilesData.html" title="View source code">Code</a> | <a href="./../pdf/AnalyzeTextFilesData.pdf" title="PDF US Letter Size">PDF</a> | <a href="./../pdfgreen/AnalyzeTextFilesData.pdf" title="PDF US Letter Size with narrow margins: www.changethemargins.com">PDFGreen</a> | <a href="./../pdfa4/AnalyzeTextFilesData.pdf" title="PDF A4 Size">PDFA4</a> | <a href="./../pdfa4green/AnalyzeTextFilesData.pdf" title="PDF A4 Size with narrow margins: www.changethemargins.com">PDFA4Green</a></td></tr>
|
|
16 </table>
|
|
17 </div>
|
|
18 <p>
|
|
19 </p>
|
|
20 <h2>NAME</h2>
|
|
21 <p>AnalyzeTextFilesData.pl - Analyze numerical coulmn data in TextFile(s)</p>
|
|
22 <p>
|
|
23 </p>
|
|
24 <h2>SYNOPSIS</h2>
|
|
25 <p>AnalyzeTextFilesData.pl TextFile(s)...</p>
|
|
26 <p>AnalyzeTextFilesData.pl [<strong>-c, --colmode</strong> colnum | collabel] [<strong>--columns</strong> "colnum,[colnum,...]" | "collabel,[collabel,...]" | All]
|
|
27 [<strong>--columnpairs</strong> "colnum,colnum,[colnum,colnum]..." | "collabel,collabel,[collabel,collabel]..." | AllPairs]
|
|
28 [<strong>-d, --detail</strong> infolevel] [<strong>-f, --fast</strong>] [<strong>--frequencybins</strong> number | "number,number,[number,...]"] [<strong>-h, --help</strong>]
|
|
29 [<strong>--indelim</strong> comma | semicolon] [<strong>--klargest</strong> number] [<strong>--ksmallest</strong> number]
|
|
30 [<strong>-m, --mode</strong> DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All | "function1, [function2,...]"]
|
|
31 [<strong>-o, --overwrite</strong>] [<strong>--outdelim</strong> comma | tab | semicolon] [<strong>-p, --precision</strong> number]
|
|
32 [<strong>-q, --quote</strong> yes | no] [<strong>-r, --root</strong> rootname] [<strong>--trimfraction</strong> number] [<strong>-w, --workingdir</strong> dirname] TextFiles(s)...</p>
|
|
33 <p>
|
|
34 </p>
|
|
35 <h2>DESCRIPTION</h2>
|
|
36 <p>Anaylze numerical column data in <em>TextFile(s)</em> using a combination of various statistical
|
|
37 functions; Non-numerical values are simply ignored. For <em>Correlation, RSquare, and Covariance</em>
|
|
38 analysis, the count of valid values in specifed column pair must be same; otherwise, column
|
|
39 pair is ignored. The file names are separated by space. The valid file extensions are <em>.csv</em>
|
|
40 and <em>.tsv</em> for comma/semicolon and tab delimited text files respectively. All other
|
|
41 file names are ignored. All the text files in a current directory can be specified by
|
|
42 <em>*.csv</em>, <em>*.tsv</em>, or the current directory name. The <strong>--indelim</strong> option determines
|
|
43 the format of <em>TextFile(s)</em>. Any file which doesn't correspond to the format indicated
|
|
44 by <strong>--indelim</strong> option is ignored.</p>
|
|
45 <p>
|
|
46 </p>
|
|
47 <h2>OPTIONS</h2>
|
|
48 <dl>
|
|
49 <dt><strong><strong>-c, --colmode</strong> <em>colnum | collabel</em></strong></dt>
|
|
50 <dd>
|
|
51 <p>Specify how columns are identified in TextFile(s): using column number or column
|
|
52 label. Possible values: <em>colnum or collabel</em>. Default value: <em>colnum</em>.</p>
|
|
53 </dd>
|
|
54 <dt><strong><strong>--columns</strong> <em>"colnum,[colnum,...]" | "collabel,[collabel]..." | All</em></strong></dt>
|
|
55 <dd>
|
|
56 <p>This value is mode specific. It's a list of comma delimited columns to use
|
|
57 for data analysis. Default value: <em>First column</em>.</p>
|
|
58 <p>This value is ignored during <em>Correlation/Pearson Correlation</em> and <em>Covariance</em>
|
|
59 data analysis; <strong>-coulmnparis</strong> option is used instead.</p>
|
|
60 <p>For <em>colnum</em> value of <strong>-c, --colmode</strong> option, input values format is:
|
|
61 <em>colnum,colnum,...</em>. Example:</p>
|
|
62 <div class="OptionsBox">
|
|
63 1,3,5</div>
|
|
64 <p>For <em>collabel</em> value of <strong>-c, --colmode</strong> option, input values format is:
|
|
65 <em>collabel,collabel,..</em>. Example:</p>
|
|
66 <div class="OptionsBox">
|
|
67 ALogP,MolWeight,EC50</div>
|
|
68 </dd>
|
|
69 <dt><strong><strong>--columnpairs</strong> <em>"colnum,colnum,[colnum,colnum,...]" | "collabel,collabel,[collabel,collabel,...]" | AllPairs</em></strong></dt>
|
|
70 <dd>
|
|
71 <p>This value is mode specific and is only used for <em>Correlation, PearsonCorrelation, or
|
|
72 Covariance</em> value of <strong>-m, --mode</strong> option. It is a comma delimited list of column pairs
|
|
73 to use for data analysis during <em>Correlation</em> and <em>Covariance</em> calculations. Default value:
|
|
74 <em>First column, Second column</em>.</p>
|
|
75 <p>For <em>colnum</em> value of <strong>-c, --colmode</strong> option, input values format is:
|
|
76 <em>colnum,colnum,[colnum,colnum]...</em>. Example:</p>
|
|
77 <div class="OptionsBox">
|
|
78 1,3,5,6,1,6</div>
|
|
79 <p>For <em>collabel</em> value of <strong>-c, --colmode</strong> option, input values format is:
|
|
80 <em>collabel,collabel,[collabel,collabel]..</em>. Example:</p>
|
|
81 <div class="OptionsBox">
|
|
82 MolWeight,EC50,NumN+O,PSA</div>
|
|
83 <p>For <em>AllPairs</em> value of <strong>--columnparis</strong> option, all column pairs are used for <em>Correlation</em>
|
|
84 and <em>Covariance</em> calculations.</p>
|
|
85 </dd>
|
|
86 <dt><strong><strong>-d, --detail</strong> <em>infolevel</em></strong></dt>
|
|
87 <dd>
|
|
88 <p>Level of information to print about column values being ignored. Default: <em>1</em>. Possible values:
|
|
89 1, 2, 3, or 4.</p>
|
|
90 </dd>
|
|
91 <dt><strong><strong>-f, --fast</strong></strong></dt>
|
|
92 <dd>
|
|
93 <p>In this mode, all the columns specified for analysis are assumed to contain numerical
|
|
94 data and no checking is performed before analysis. By default, only numerical data is
|
|
95 used for analysis; other types of column data is ignored.</p>
|
|
96 </dd>
|
|
97 <dt><strong><strong>--frequencybins</strong> <em>number | "number,number,[number,...]"</em></strong></dt>
|
|
98 <dd>
|
|
99 <p>Specify number of bins or bin range to use for frequency analysis. Default value: <em>10</em></p>
|
|
100 <p>Number of bins value along with the smallest and largest value for a column is used to
|
|
101 group the column values into different groups.</p>
|
|
102 <p>The bin range list is used to group values for a column into different groups; It must contain
|
|
103 values in ascending order. Examples:</p>
|
|
104 <div class="OptionsBox">
|
|
105 10,20,30
|
|
106 <br/> 0.1,0.2,0.3,0.4,0.5</div>
|
|
107 <p>The frequency value calculated for a specific bin corresponds to all the column values
|
|
108 which are greater than the previous bin value and less than or equal to the current bin value.</p>
|
|
109 </dd>
|
|
110 <dt><strong><strong>-h, --help</strong></strong></dt>
|
|
111 <dd>
|
|
112 <p>Print this help message.</p>
|
|
113 </dd>
|
|
114 <dt><strong><strong>--indelim</strong> <em>comma | semicolon</em></strong></dt>
|
|
115 <dd>
|
|
116 <p>Input delimiter for CSV <em>TextFile(s)</em>. Possible values: <em>comma or semicolon</em>.
|
|
117 Default value: <em>comma</em>. For TSV files, this option is ignored and <em>tab</em> is used as a
|
|
118 delimiter.</p>
|
|
119 </dd>
|
|
120 <dt><strong><strong>--klargest</strong> <em>number</em></strong></dt>
|
|
121 <dd>
|
|
122 <p>Kth largest value to find by <em>KLargest</em> function. Default value: <em>2</em> Valid values: positive
|
|
123 integers.</p>
|
|
124 </dd>
|
|
125 <dt><strong><strong>--ksmallest</strong> <em>number</em></strong></dt>
|
|
126 <dd>
|
|
127 <p>Kth smallest value to find by <em>KSmallest</em> function. Default value: <em>2</em>. Valid values: positive
|
|
128 integers.</p>
|
|
129 </dd>
|
|
130 <dt><strong><strong>-m, --mode</strong> <em>DescriptiveStatisticsBasic | DescriptiveStatisticsAll | All | "function1, [function2,...]"</em></strong></dt>
|
|
131 <dd>
|
|
132 <p>Specify how to analyze data in TextFile(s): calculate basic or all descriptive statistics; or
|
|
133 use a comma delimited list of supported statistical functions. Possible values:
|
|
134 <em>DescriptiveStatisticsBasic | DescriptiveStatisticsAll | "function1,[function2]..."</em>. Default
|
|
135 value: <em>DescriptiveStatisticsBasic</em></p>
|
|
136 <p><em>DescriptiveStatisticsBasic</em> includes these functions: <em>Count, Maximum, Minimum, Mean,
|
|
137 Median, Sum, StandardDeviation, StandardError, Variance</em>.</p>
|
|
138 <p><em>DescriptiveStatisticsAll</em>, in addition to <em>DescriptiveStatisticsBasic</em> functions, includes:
|
|
139 <em>GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis, Mode, RSquare,
|
|
140 Skewness, TrimMean</em>.</p>
|
|
141 <p><em>All</em> uses complete list of supported functions: <em>Average, AverageDeviation, Correlation,
|
|
142 Count, Covariance, GeometricMean, Frequency, HarmonicMean, KLargest, KSmallest, Kurtosis,
|
|
143 Maximum, Minimum, Mean, Median, Mode, RSquare, Skewness, Sum,
|
|
144 SumOfSquares, StandardDeviation, StandardDeviationN, StandardError, StandardScores,
|
|
145 StandardScoresN, TrimMean, Variance, VarianceN</em>. The function names ending with N
|
|
146 calculate corresponding values assuming an entire population instead of a population sample.</p>
|
|
147 <p>Here are the formulas for these functions:</p>
|
|
148 <p>Average: See Mean</p>
|
|
149 <p>AverageDeviation: SUM( ABS(x[i] - Xmean) ) / n</p>
|
|
150 <p>Correlation: See Pearson Correlation</p>
|
|
151 <p>Covariance: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / n</p>
|
|
152 <p>GeometricMean: NthROOT( PRODUCT(x[i]) )</p>
|
|
153 <p>HarmonicMean: 1 / ( SUM(1/x[i]) / n )</p>
|
|
154 <p>Mean: SUM( x[i] ) / n</p>
|
|
155 <p>Median: Xsorted[(n - 1)/2 + 1] for even values of n; (Xsorted[n/2] + Xsorted[n/2 + 1])/2
|
|
156 for odd values of n.</p>
|
|
157 <p>Kurtosis: [ {n(n + 1)/(n - 1)(n - 2)(n - 3)} SUM{ ((x[i] - Xmean)/STDDEV)^4 } ] -
|
|
158 {3((n - 1)^2)}/{(n - 2)(n-3)}</p>
|
|
159 <p>PearsonCorrelation: SUM( (x[i] - Xmean)(y[i] - Ymean) ) / SQRT( SUM( (x[i] - Xmean)^2 )
|
|
160 (SUM( (y[i] - Ymean)^2 )) )</p>
|
|
161 <p>RSquare: PearsonCorrelation^2</p>
|
|
162 <p>Skewness: {n/(n - 1)(n - 2)} SUM{ ((x[i] - Xmean)/STDDEV)^3 }</p>
|
|
163 <p>StandardDeviation: SQRT ( SUM( (x[i] - Mean)^2 ) / (n - 1) )</p>
|
|
164 <p>StandardDeviationN: SQRT ( SUM( (x[i] - Mean)^2 ) / n )</p>
|
|
165 <p>StandardError: StandardDeviation / SQRT( n )</p>
|
|
166 <p>StandardScore: (x[i] - Mean) / (n - 1)</p>
|
|
167 <p>StandardScoreN: (x[i] - Mean) / n</p>
|
|
168 <p>Variance: SUM( (x[i] - Xmean)^2 / (n - 1) )</p>
|
|
169 <p>VarianceN: SUM( (x[i] - Xmean)^2 / n )</p>
|
|
170 </dd>
|
|
171 <dt><strong><strong>-o, --overwrite</strong></strong></dt>
|
|
172 <dd>
|
|
173 <p>Overwrite existing files.</p>
|
|
174 </dd>
|
|
175 <dt><strong><strong>--outdelim</strong> <em>comma | tab | semicolon</em></strong></dt>
|
|
176 <dd>
|
|
177 <p>Output text file delimiter. Possible values: <em>comma, tab, or semicolon</em>
|
|
178 Default value: <em>comma</em>.</p>
|
|
179 </dd>
|
|
180 <dt><strong><strong>-p, --precision</strong> <em>number</em></strong></dt>
|
|
181 <dd>
|
|
182 <p>Precision of calculated values in the output file. Default: up to <em>2</em> decimal places.
|
|
183 Valid values: positive integers.</p>
|
|
184 </dd>
|
|
185 <dt><strong><strong>-q, --quote</strong> <em>yes | no</em></strong></dt>
|
|
186 <dd>
|
|
187 <p>Put quotes around column values in output text file. Possible values: <em>yes or
|
|
188 no</em>. Default value: <em>yes</em>.</p>
|
|
189 </dd>
|
|
190 <dt><strong><strong>-r, --root</strong> <em>rootname</em></strong></dt>
|
|
191 <dd>
|
|
192 <p>New text file name is generated using the root: <Root>.<Ext>. Default new file
|
|
193 name: <InitialTextFileName><Mode>.<Ext>. Based on the specified analysis,
|
|
194 <Mode> corresponds to one of these values: DescriptiveStatisticsBasic,
|
|
195 DescriptiveStatisticsAll, AllStatistics, SpecifiedStatistics, Covariance, Correlation,
|
|
196 Frequency, or StandardScores. The csv, and tsv <Ext> values are used for
|
|
197 comma/semicolon, and tab delimited text files respectively. This option is ignored for
|
|
198 multiple input files.</p>
|
|
199 </dd>
|
|
200 <dt><strong><strong>--trimfraction</strong> <em>number</em></strong></dt>
|
|
201 <dd>
|
|
202 <p>Fraction of data to exclude from the top and bottom of the data set during
|
|
203 <em>TrimMean</em> calculation. Default value: <em>0.1</em>. Valid values: > 0 and < 1.</p>
|
|
204 </dd>
|
|
205 <dt><strong><strong>-w --workingdir</strong> <em>text</em></strong></dt>
|
|
206 <dd>
|
|
207 <p>Location of working directory. Default: current directory.</p>
|
|
208 </dd>
|
|
209 </dl>
|
|
210 <p>
|
|
211 </p>
|
|
212 <h2>EXAMPLES</h2>
|
|
213 <p>To calculate basic statistics for data in first column and generate a
|
|
214 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
|
|
215 <div class="ExampleBox">
|
|
216 % AnalyzeTextFilesData.pl -o -r NewSample1 Sample1.csv</div>
|
|
217 <p>To calculate basic statistics for data in third column and generate a
|
|
218 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
|
|
219 <div class="ExampleBox">
|
|
220 % AnalyzeTextFilesData.pl --columns 3 -o -r NewSample1 Sample1.csv</div>
|
|
221 <p>To calculate basic statistics for data in MolWeight column and generate a
|
|
222 NewSample1DescriptiveStatisticsBasic.csv file, type:</p>
|
|
223 <div class="ExampleBox">
|
|
224 % AnalyzeTextFilesData.pl -colmode collabel --columns MolWeight -o
|
|
225 -r NewSample1 Sample1.csv</div>
|
|
226 <p>To calculate all available statistics for data in third column and all column pairs,
|
|
227 and generate NewSample1DescriptiveStatisticsAll.csv, NewSample1CorrelationMatrix.csv,
|
|
228 NewSample1CorrelationMatrix.csv, and NewSample1MolWeightFrequencyAnalysis.csv files,
|
|
229 type:</p>
|
|
230 <div class="ExampleBox">
|
|
231 % AnalyzeTextFilesData.pl -m DescriptiveStatisticsAll --columns 3 -o
|
|
232 --columnpairs AllPairs -r NewSample1 Sample1.csv</div>
|
|
233 <p>To compute frequency distribution of data in third column into five bins and
|
|
234 generate NewSample1MolWeightFrequencyAnalysis.csv, type:</p>
|
|
235 <div class="ExampleBox">
|
|
236 % AnalyzeTextFilesData.pl -m Frequency --frequencybins 5 --columns 3
|
|
237 -o -r NewSample1 Sample1.csv</div>
|
|
238 <p>To compute frequency distribution of data in third column into specified bin range
|
|
239 values, and generate NewSample1MolWeightFrequencyAnalysis.csv, type:</p>
|
|
240 <div class="ExampleBox">
|
|
241 % AnalyzeTextFilesData.pl -m Frequency --frequencybins "100,200,400"
|
|
242 --columns 3 -o -r NewSample1 Sample1.csv</div>
|
|
243 <p>To calculate all available statistics for data in all columns and column pairs, type:</p>
|
|
244 <div class="ExampleBox">
|
|
245 % AnalyzeTextFilesData.pl -m All --columns All --columnpairs
|
|
246 AllPairs -o -r NewSample1 Sample1.csv</div>
|
|
247 <p>
|
|
248 </p>
|
|
249 <h2>AUTHOR</h2>
|
|
250 <p><a href="mailto:msud@san.rr.com">Manish Sud</a></p>
|
|
251 <p>
|
|
252 </p>
|
|
253 <h2>SEE ALSO</h2>
|
|
254 <p><a href="./JoinTextFiles.html">JoinTextFiles.pl</a>, <a href="./MergeTextFilesWithSD.html">MergeTextFilesWithSD.pl</a>, <a href="./ModifyTextFilesFormat.html">ModifyTextFilesFormat.pl</a>, <a href="./SplitTextFiles.html">SplitTextFiles.pl</a>, <a href="./TextFilesToHTML.html">TextFilesToHTML.pl</a>
|
|
255 </p>
|
|
256 <p>
|
|
257 </p>
|
|
258 <h2>COPYRIGHT</h2>
|
|
259 <p>Copyright (C) 2015 Manish Sud. All rights reserved.</p>
|
|
260 <p>This file is part of MayaChemTools.</p>
|
|
261 <p>MayaChemTools is free software; you can redistribute it and/or modify it under
|
|
262 the terms of the GNU Lesser General Public License as published by the Free
|
|
263 Software Foundation; either version 3 of the License, or (at your option)
|
|
264 any later version.</p>
|
|
265 <p> </p><p> </p><div class="DocNav">
|
|
266 <table width="100%" border=0 cellpadding=0 cellspacing=2>
|
|
267 <tr align="left" valign="top"><td width="33%" align="left"><a href="./AnalyzeSequenceFilesData.html" title="AnalyzeSequenceFilesData.html">Previous</a> <a href="./index.html" title="Table of Contents">TOC</a> <a href="./AtomNeighborhoodsFingerprints.html" title="AtomNeighborhoodsFingerprints.html">Next</a></td><td width="34%" align="middle"><strong>March 29, 2015</strong></td><td width="33%" align="right"><strong>AnalyzeTextFilesData.pl</strong></td></tr>
|
|
268 </table>
|
|
269 </div>
|
|
270 <br />
|
|
271 <center>
|
|
272 <img src="../../images/h2o2.png">
|
|
273 </center>
|
|
274 </body>
|
|
275 </html>
|