0
|
1 NAME
|
|
2 ExtractFromTextFiles.pl - Extract specific data from TextFile(s)
|
|
3
|
|
4 SYNOPSIS
|
|
5 ExtractFromTextFiles.pl TextFile(s)...
|
|
6
|
|
7 ExtractFromTextFiles.pl [-c, --colmode colnum | collabel] [--categorycol
|
|
8 number | string] [--columns "colnum,[colnum]..." |
|
|
9 "collabel,[collabel]..."] [-h, --help] [--indelim *comma | semicolon*]
|
|
10 [-m, --mode *columns | rows | categories*] [-o, --overwrite] [--outdelim
|
|
11 *comma | tab | semicolon*] [-q, --quote *yes | no*] [--rows
|
|
12 "colid,value,criteria..." | "colid,value..." |
|
|
13 "colid,mincolvalue,maxcolvalue" | "rownum,rownum,..." | colid |
|
|
14 "minrownum,maxrownum"] [ --rowsmode rowsbycolvalue | rowsbycolvaluelist
|
|
15 | rowsbycolvaluerange | rowbymincolvalue | rowbymaxcolvalue | rownums |
|
|
16 rownumrange] [-r, --root *rootname*] [-w, --workingdir *dirname*]
|
|
17 TextFile(s)...
|
|
18
|
|
19 DESCRIPTION
|
|
20 Extract column(s)/row(s) data from *TextFile(s)* identified by column
|
|
21 numbers or labels. Or categorize data using a specified column category.
|
|
22 During categorization, a summary text file is generated containing
|
|
23 category name and count; an additional text file, containing data for
|
|
24 for each category, is also generated. The file names are separated by
|
|
25 space. The valid file extensions are *.csv* and *.tsv* for
|
|
26 comma/semicolon and tab delimited text files respectively. All other
|
|
27 file names are ignored. All the text files in a current directory can be
|
|
28 specified by **.csv*, **.tsv*, or the current directory name. The
|
|
29 --indelim option determines the format of *TextFile(s)*. Any file which
|
|
30 doesn't correspond to the format indicated by --indelim option is
|
|
31 ignored.
|
|
32
|
|
33 OPTIONS
|
|
34 -c, --colmode *colnum | collabel*
|
|
35 Specify how columns are identified in *TextFile(s)*: using column
|
|
36 number or column label. Possible values: *colnum or collabel*.
|
|
37 Default value: *colnum*.
|
|
38
|
|
39 --categorycol *number | string*
|
|
40 Column used to categorize data. Default value: First column.
|
|
41
|
|
42 For *colnum* value of -c, --colmode option, input value is a column
|
|
43 number. Example: *1*.
|
|
44
|
|
45 For *collabel* value of -c, --colmode option, input value is a
|
|
46 column label. Example: *Mol_ID*.
|
|
47
|
|
48 --columns *"colnum,[colnum]..." | "collabel,[collabel]..."*
|
|
49 List of comma delimited columns to extract. Default value: First
|
|
50 column.
|
|
51
|
|
52 For *colnum* value of -c, --colmode option, input values format is:
|
|
53 *colnum,colnum,...*. Example: *1,3,5*
|
|
54
|
|
55 For *collabel* value of -c, --colmode option, input values format
|
|
56 is: *collabel,collabel,..*. Example: *Mol_ID,MolWeight*
|
|
57
|
|
58 -h, --help
|
|
59 Print this help message.
|
|
60
|
|
61 --indelim *comma | semicolon*
|
|
62 Input delimiter for CSV *TextFile(s)*. Possible values: *comma or
|
|
63 semicolon*. Default value: *comma*. For TSV files, this option is
|
|
64 ignored and *tab* is used as a delimiter.
|
|
65
|
|
66 -m, --mode *columns | rows | categories*
|
|
67 Specify what to extract from *TextFile(s)*. Possible values:
|
|
68 *columns, rows, or categories*. Default value: *columns*.
|
|
69
|
|
70 For *columns* mode, data for appropriate columns specified by
|
|
71 --columns option is extracted from *TextFile(s)* and placed into new
|
|
72 text files.
|
|
73
|
|
74 For *rows* mode, appropriate rows specified in conjuction with
|
|
75 --rowsmode and rows options are extracted from *TextFile(s)* and
|
|
76 placed into new text files.
|
|
77
|
|
78 For *categories* mode, coulmn specified by --categorycol is used to
|
|
79 categorize data, and a summary text file is generated containing
|
|
80 category name and count; an additional text file, containing data
|
|
81 for for each category, is also generated.
|
|
82
|
|
83 -o, --overwrite
|
|
84 Overwrite existing files.
|
|
85
|
|
86 --outdelim *comma | tab | semicolon*.
|
|
87 Output text file delimiter. Possible values: *comma, tab, or
|
|
88 semicolon*. Default value: *comma*
|
|
89
|
|
90 -q, --quote *yes | no*
|
|
91 Put quotes around column values in output text file. Possible
|
|
92 values: *yes or no*. Default value: *yes*.
|
|
93
|
|
94 -r, --root *rootname*
|
|
95 New file name is generated using the root: <Root>.<Ext>. Default for
|
|
96 new file names: <TextFile>CategoriesSummary.<Ext>,
|
|
97 <TextFile>ExtractedColumns.<Ext>, and <TextFile>ExtractedRows.<Ext>
|
|
98 for *categories*, *columns*, and *rows* mode respectively. And
|
|
99 <TextFile>Category<CategoryName>.<Ext> for each category retrieved
|
|
100 from each text file. The output file type determines <Ext> value:
|
|
101 csv and tsv for CSV, and TSV files respectively.
|
|
102
|
|
103 This option is ignored for multiple input files.
|
|
104
|
|
105 --rows *"colid,value,criteria..." | "colid,value..." |
|
|
106 "colid,mincolvalue,maxcolvalue" | "rownum,rownum,..." | colid |
|
|
107 "minrownum,maxrownum"*
|
|
108 This value is --rowsmode specific. In general, it's a list of comma
|
|
109 separated column ids and associated mode specific value. Based on
|
|
110 Column ids specification, column label or number, is controlled by
|
|
111 -c, --colmode option.
|
|
112
|
|
113 First line containing column labels is always written out. And value
|
|
114 comparisons assume numerical column data.
|
|
115
|
|
116 For *rowsbycolvalue* mode, input value format contains these
|
|
117 triplets: *colid,value, criteria...*. Possible values for criteria:
|
|
118 *le, ge or eq*. Examples:
|
|
119
|
|
120 MolWt,450,le
|
|
121 MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le
|
|
122
|
|
123 For *rowsbycolvaluelist* mode, input value format is:
|
|
124 *colid,value...*. Examples:
|
|
125
|
|
126 Mol_ID,20
|
|
127 Mol_ID,20,1002,1115
|
|
128
|
|
129 For *rowsbycolvaluerange* mode, input value format is:
|
|
130 *colid,mincolvalue,maxcolvalue*. Examples:
|
|
131
|
|
132 MolWt,100,450
|
|
133
|
|
134 For *rowbymincolvalue, rowbymaxcolvalue* modes, input value format
|
|
135 is: *colid*.
|
|
136
|
|
137 For *rownum* mode, input value format is: *rownum*. Default value:
|
|
138 *2*.
|
|
139
|
|
140 For *rownumrange* mode, input value format is: *minrownum,
|
|
141 maxrownum*. Examples:
|
|
142
|
|
143 10,40
|
|
144
|
|
145 --rowsmode *rowsbycolvalue | rowsbycolvaluelist | rowsbycolvaluerange |
|
|
146 rowbymincolvalue | rowbymaxcolvalue | rownums | rownumrange*
|
|
147 Specify how to extract rows from *TextFile(s)*. Possible values:
|
|
148 *rowsbycolvalue, rowsbycolvaluelist, rowsbycolvaluerange,
|
|
149 rowbymincolvalue, rowbymaxcolvalue, rownum, rownumrange*. Default
|
|
150 value: *rownum*.
|
|
151
|
|
152 Use --rows option to list rows criterion used for extraction of rows
|
|
153 from *TextFile(s)*.
|
|
154
|
|
155 -w, --workingdir *dirname*
|
|
156 Location of working directory. Default: current directory.
|
|
157
|
|
158 EXAMPLES
|
|
159 To extract first column from a text file and generate a new CSV text
|
|
160 file NewSample1.csv, type:
|
|
161
|
|
162 % ExtractFromTextFiles.pl -r NewSample1 -o Sample1.csv
|
|
163
|
|
164 To extract columns Mol_ID, MolWeight, and NAME from Sample1.csv and
|
|
165 generate a new textfile NewSample1.tsv with no quotes, type:
|
|
166
|
|
167 % ExtractFromTextFiles.pl -m columns -c collabel --columns "Mol_ID,
|
|
168 MolWeight,NAME" --outdelim tab --quote no -r NewSample1
|
|
169 -o Sample1.csv
|
|
170
|
|
171 To extract rows containing values for MolWeight column of less than 450
|
|
172 from Sample1.csv and generate a new textfile NewSample1.csv, type:
|
|
173
|
|
174 % ExtractFromTextFiles.pl -m rows --rowsmode rowsbycolvalue
|
|
175 -c collabel --rows MolWeight,450,le -r NewSample1
|
|
176 -o Sample1.csv
|
|
177
|
|
178 To extract rows containing values for MolWeight column between 400 and
|
|
179 500 from Sample1.csv and generate a new textfile NewSample1.csv, type:
|
|
180
|
|
181 % ExtractFromTextFiles.pl -m rows --rowsmode rowsbycolvaluerange
|
|
182 -c collabel --rows MolWeight,450,500 -r NewSample1
|
|
183 -o Sample1.csv
|
|
184
|
|
185 To extract a row containing minimum value for column MolWeight from
|
|
186 Sample1.csv and generate a new textfile NewSample1.csv, type:
|
|
187
|
|
188 % ExtractFromTextFiles.pl -m rows --rowsmode rowbymincolvalue
|
|
189 -c collabel --rows MolWeight -r NewSample1
|
|
190 -o Sample1.csv
|
|
191
|
|
192 AUTHOR
|
|
193 Manish Sud <msud@san.rr.com>
|
|
194
|
|
195 SEE ALSO
|
|
196 JoinTextFiles.pl, MergeTextFilesWithSD.pl, ModifyTextFilesFormat.pl,
|
|
197 SplitTextFiles.pl
|
|
198
|
|
199 COPYRIGHT
|
|
200 Copyright (C) 2015 Manish Sud. All rights reserved.
|
|
201
|
|
202 This file is part of MayaChemTools.
|
|
203
|
|
204 MayaChemTools is free software; you can redistribute it and/or modify it
|
|
205 under the terms of the GNU Lesser General Public License as published by
|
|
206 the Free Software Foundation; either version 3 of the License, or (at
|
|
207 your option) any later version.
|
|
208
|