comparison docs/scripts/txt/ExtractFromTextFiles.txt @ 0:4816e4a8ae95 draft default tip

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 09:23:18 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4816e4a8ae95
1 NAME
2 ExtractFromTextFiles.pl - Extract specific data from TextFile(s)
3
4 SYNOPSIS
5 ExtractFromTextFiles.pl TextFile(s)...
6
7 ExtractFromTextFiles.pl [-c, --colmode colnum | collabel] [--categorycol
8 number | string] [--columns "colnum,[colnum]..." |
9 "collabel,[collabel]..."] [-h, --help] [--indelim *comma | semicolon*]
10 [-m, --mode *columns | rows | categories*] [-o, --overwrite] [--outdelim
11 *comma | tab | semicolon*] [-q, --quote *yes | no*] [--rows
12 "colid,value,criteria..." | "colid,value..." |
13 "colid,mincolvalue,maxcolvalue" | "rownum,rownum,..." | colid |
14 "minrownum,maxrownum"] [ --rowsmode rowsbycolvalue | rowsbycolvaluelist
15 | rowsbycolvaluerange | rowbymincolvalue | rowbymaxcolvalue | rownums |
16 rownumrange] [-r, --root *rootname*] [-w, --workingdir *dirname*]
17 TextFile(s)...
18
19 DESCRIPTION
20 Extract column(s)/row(s) data from *TextFile(s)* identified by column
21 numbers or labels. Or categorize data using a specified column category.
22 During categorization, a summary text file is generated containing
23 category name and count; an additional text file, containing data for
24 for each category, is also generated. The file names are separated by
25 space. The valid file extensions are *.csv* and *.tsv* for
26 comma/semicolon and tab delimited text files respectively. All other
27 file names are ignored. All the text files in a current directory can be
28 specified by **.csv*, **.tsv*, or the current directory name. The
29 --indelim option determines the format of *TextFile(s)*. Any file which
30 doesn't correspond to the format indicated by --indelim option is
31 ignored.
32
33 OPTIONS
34 -c, --colmode *colnum | collabel*
35 Specify how columns are identified in *TextFile(s)*: using column
36 number or column label. Possible values: *colnum or collabel*.
37 Default value: *colnum*.
38
39 --categorycol *number | string*
40 Column used to categorize data. Default value: First column.
41
42 For *colnum* value of -c, --colmode option, input value is a column
43 number. Example: *1*.
44
45 For *collabel* value of -c, --colmode option, input value is a
46 column label. Example: *Mol_ID*.
47
48 --columns *"colnum,[colnum]..." | "collabel,[collabel]..."*
49 List of comma delimited columns to extract. Default value: First
50 column.
51
52 For *colnum* value of -c, --colmode option, input values format is:
53 *colnum,colnum,...*. Example: *1,3,5*
54
55 For *collabel* value of -c, --colmode option, input values format
56 is: *collabel,collabel,..*. Example: *Mol_ID,MolWeight*
57
58 -h, --help
59 Print this help message.
60
61 --indelim *comma | semicolon*
62 Input delimiter for CSV *TextFile(s)*. Possible values: *comma or
63 semicolon*. Default value: *comma*. For TSV files, this option is
64 ignored and *tab* is used as a delimiter.
65
66 -m, --mode *columns | rows | categories*
67 Specify what to extract from *TextFile(s)*. Possible values:
68 *columns, rows, or categories*. Default value: *columns*.
69
70 For *columns* mode, data for appropriate columns specified by
71 --columns option is extracted from *TextFile(s)* and placed into new
72 text files.
73
74 For *rows* mode, appropriate rows specified in conjuction with
75 --rowsmode and rows options are extracted from *TextFile(s)* and
76 placed into new text files.
77
78 For *categories* mode, coulmn specified by --categorycol is used to
79 categorize data, and a summary text file is generated containing
80 category name and count; an additional text file, containing data
81 for for each category, is also generated.
82
83 -o, --overwrite
84 Overwrite existing files.
85
86 --outdelim *comma | tab | semicolon*.
87 Output text file delimiter. Possible values: *comma, tab, or
88 semicolon*. Default value: *comma*
89
90 -q, --quote *yes | no*
91 Put quotes around column values in output text file. Possible
92 values: *yes or no*. Default value: *yes*.
93
94 -r, --root *rootname*
95 New file name is generated using the root: <Root>.<Ext>. Default for
96 new file names: <TextFile>CategoriesSummary.<Ext>,
97 <TextFile>ExtractedColumns.<Ext>, and <TextFile>ExtractedRows.<Ext>
98 for *categories*, *columns*, and *rows* mode respectively. And
99 <TextFile>Category<CategoryName>.<Ext> for each category retrieved
100 from each text file. The output file type determines <Ext> value:
101 csv and tsv for CSV, and TSV files respectively.
102
103 This option is ignored for multiple input files.
104
105 --rows *"colid,value,criteria..." | "colid,value..." |
106 "colid,mincolvalue,maxcolvalue" | "rownum,rownum,..." | colid |
107 "minrownum,maxrownum"*
108 This value is --rowsmode specific. In general, it's a list of comma
109 separated column ids and associated mode specific value. Based on
110 Column ids specification, column label or number, is controlled by
111 -c, --colmode option.
112
113 First line containing column labels is always written out. And value
114 comparisons assume numerical column data.
115
116 For *rowsbycolvalue* mode, input value format contains these
117 triplets: *colid,value, criteria...*. Possible values for criteria:
118 *le, ge or eq*. Examples:
119
120 MolWt,450,le
121 MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le
122
123 For *rowsbycolvaluelist* mode, input value format is:
124 *colid,value...*. Examples:
125
126 Mol_ID,20
127 Mol_ID,20,1002,1115
128
129 For *rowsbycolvaluerange* mode, input value format is:
130 *colid,mincolvalue,maxcolvalue*. Examples:
131
132 MolWt,100,450
133
134 For *rowbymincolvalue, rowbymaxcolvalue* modes, input value format
135 is: *colid*.
136
137 For *rownum* mode, input value format is: *rownum*. Default value:
138 *2*.
139
140 For *rownumrange* mode, input value format is: *minrownum,
141 maxrownum*. Examples:
142
143 10,40
144
145 --rowsmode *rowsbycolvalue | rowsbycolvaluelist | rowsbycolvaluerange |
146 rowbymincolvalue | rowbymaxcolvalue | rownums | rownumrange*
147 Specify how to extract rows from *TextFile(s)*. Possible values:
148 *rowsbycolvalue, rowsbycolvaluelist, rowsbycolvaluerange,
149 rowbymincolvalue, rowbymaxcolvalue, rownum, rownumrange*. Default
150 value: *rownum*.
151
152 Use --rows option to list rows criterion used for extraction of rows
153 from *TextFile(s)*.
154
155 -w, --workingdir *dirname*
156 Location of working directory. Default: current directory.
157
158 EXAMPLES
159 To extract first column from a text file and generate a new CSV text
160 file NewSample1.csv, type:
161
162 % ExtractFromTextFiles.pl -r NewSample1 -o Sample1.csv
163
164 To extract columns Mol_ID, MolWeight, and NAME from Sample1.csv and
165 generate a new textfile NewSample1.tsv with no quotes, type:
166
167 % ExtractFromTextFiles.pl -m columns -c collabel --columns "Mol_ID,
168 MolWeight,NAME" --outdelim tab --quote no -r NewSample1
169 -o Sample1.csv
170
171 To extract rows containing values for MolWeight column of less than 450
172 from Sample1.csv and generate a new textfile NewSample1.csv, type:
173
174 % ExtractFromTextFiles.pl -m rows --rowsmode rowsbycolvalue
175 -c collabel --rows MolWeight,450,le -r NewSample1
176 -o Sample1.csv
177
178 To extract rows containing values for MolWeight column between 400 and
179 500 from Sample1.csv and generate a new textfile NewSample1.csv, type:
180
181 % ExtractFromTextFiles.pl -m rows --rowsmode rowsbycolvaluerange
182 -c collabel --rows MolWeight,450,500 -r NewSample1
183 -o Sample1.csv
184
185 To extract a row containing minimum value for column MolWeight from
186 Sample1.csv and generate a new textfile NewSample1.csv, type:
187
188 % ExtractFromTextFiles.pl -m rows --rowsmode rowbymincolvalue
189 -c collabel --rows MolWeight -r NewSample1
190 -o Sample1.csv
191
192 AUTHOR
193 Manish Sud <msud@san.rr.com>
194
195 SEE ALSO
196 JoinTextFiles.pl, MergeTextFilesWithSD.pl, ModifyTextFilesFormat.pl,
197 SplitTextFiles.pl
198
199 COPYRIGHT
200 Copyright (C) 2015 Manish Sud. All rights reserved.
201
202 This file is part of MayaChemTools.
203
204 MayaChemTools is free software; you can redistribute it and/or modify it
205 under the terms of the GNU Lesser General Public License as published by
206 the Free Software Foundation; either version 3 of the License, or (at
207 your option) any later version.
208