comparison docs/scripts/txt/ExtractFromSDFiles.txt @ 0:4816e4a8ae95 draft default tip

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 09:23:18 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4816e4a8ae95
1 NAME
2 ExtractFromSDFiles.pl - Extract specific data from SDFile(s)
3
4 SYNOPSIS
5 ExtractFromSDFiles.pl SDFile(s)...
6
7 ExtractFromSDFiles.pl [-h, --help] [-d, --datafields "fieldlabel,..." |
8 "fieldlabel,value,criteria..." | "fieldlabel,value,value..."]
9 [--datafieldsfile filename] [--indelim comma | tab | semicolon] [-m,
10 --mode alldatafields | commondatafields | | datafieldnotbylist |
11 datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist |
12 datafielduniquebylist | molnames | randomcmpds | recordnum | recordnums
13 | recordrange | 2dcmpdrecords | 3dcmpdrecords ] [-n, --numofcmpds
14 number] [--outdelim comma | tab | semicolon] [--output SD | text | both]
15 [-o, --overwrite] [-q, --quote yes | no] [--record recnum |
16 startrecnum,endrecnum] --RegexIgnoreCase *yes or no* [-r, --root
17 rootname] [-s, --seed number] [--StrDataString yes | no]
18 [--StrDataStringDelimiter text] [--StrDataStringMode StrOnly |
19 StrAndDataFields] [--ValueComparisonMode *Numeric | Alphanumeric*] [-v,
20 --violations- number] [-w, --workingdir dirname] SDFile(s)...
21
22 DESCRIPTION
23 Extract specific data from *SDFile(s)* and generate appropriate SD or
24 CSV/TSV text file(s). The structure data from SDFile(s) is not
25 transferred to CSV/TSV text file(s). Multiple SDFile names are separated
26 by spaces. The valid file extensions are *.sdf* and *.sd*. All other
27 file names are ignored. All the SD files in a current directory can be
28 specified either by **.sdf* or the current directory name.
29
30 OPTIONS
31 -h, --help
32 Print this help message.
33
34 -d, --datafields *"fieldlabel,..." | "fieldlabel,value,criteria..." |
35 "fieldlabel,value,value,..."*
36 This value is mode specific. In general, it's a list of comma
37 separated data field labels and associated mode specific values.
38
39 For *datafields* mode, input value format is: *fieldlabel,...*.
40 Examples:
41
42 Extreg
43 Extreg,CompoundName,ID
44
45 For *datafieldsbyvalue* mode, input value format contains these
46 triplets: *fieldlabel,value, criteria...*. Possible values for
47 criteria: *le, ge or eq*. The values of --ValueComparisonMode
48 indicates whether values are compared numerical or string comarison
49 operators. Default is to consider data field values as numerical
50 values and use numerical comparison operators. Examples:
51
52 MolWt,450,le
53 MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le
54
55 For *datafieldsbyregex* mode, input value format contains these
56 triplets: *fieldlabel,regex, criteria...*. *regex* corresponds to
57 any valid regular expression and is used to match the values for
58 specified *fieldlabel*. Possible values for criteria: *eq or ne*.
59 During *eq* and *ne* values, data field label value is matched with
60 regular expression using =~ and !~ respectively. --RegexIgnoreCase
61 option value is used to determine whether to ignore letter
62 upper/lower case during regular expression match. Examples:
63
64 Name,ol,eq
65 Name,'^pat',ne
66
67 For *datafieldbylist* and *datafielduniquebylist* mode, input value
68 format is: *fieldlabel,value1,value2...*. This is equivalent to
69 *datafieldsbyvalue* mode with this input value
70 format:*fieldlabel,value1,eq,fieldlabel,value2,eq,...*. For
71 *datafielduniquebylist* mode, only unique compounds identified by
72 first occurrence of *value* associated with *fieldlabel* in
73 *SDFile(s)* are kept; any subsequent compounds are simply ignored.
74
75 For *datafieldnotbylist* mode, input value format is:
76 *fieldlabel,value1,value2...*. In this mode, the script behaves
77 exactly opposite of *datafieldbylist* mode, and only those compounds
78 are extracted whose data field values don't match any specified data
79 field value.
80
81 --datafieldsfile *filename*
82 Filename which contains various mode specific values. This option
83 provides a way to specify mode specific values in a file instead of
84 entering them on the command line using -d --datafields.
85
86 For *datafields* mode, input file lines contain comma delimited
87 field labels: *fieldlabel,...*. Example:
88
89 Line 1:MolId
90 Line 2:"Extreg",CompoundName,ID
91
92 For *datafieldsbyvalue* mode, input file lines contains these comma
93 separated triplets: *fieldlabel,value, criteria*. Possible values
94 for criteria: *le, ge or eq*. Examples:
95
96 Line 1:MolWt,450,le
97
98 Line 1:"MolWt",450,le,"LogP",5,le,"SumNumNO",10,le,"SumNHOH",5,le
99
100 Line 1:MolWt,450,le
101 Line 2:"LogP",5,le
102 Line 3:"SumNumNO",10,le
103 Line 4: SumNHOH,5,le
104
105 For *datafieldbylist* and *datafielduniquebylist* mode, input file
106 line format is:
107
108 Line 1:fieldlabel;
109 Subsequent lines:value1,value2...
110
111 For *datafieldbylist*, *datafielduniquebylist*, and
112 *datafieldnotbylist* mode, input file line format is:
113
114 Line 1:fieldlabel;
115 Subsequent lines:value1,value2...
116
117 For *datafielduniquebylist* mode, only unique compounds identified
118 by first occurrence of *value* associated with *fieldlabel* in
119 *SDFile(s)* are kept; any subsequent compounds are simply ignored.
120 Example:
121
122 Line 1: MolID
123 Subsequent Lines:
124 907508
125 832291,4642
126 "1254","907303"
127
128 --indelim *comma | tab | semicolon*
129 Delimiter used to specify text values for -d --datafields and
130 --datafieldsfile options. Possible values: *comma, tab, or
131 semicolon*. Default value: *comma*.
132
133 -m, --mode *alldatafields | commondatafields | datafields |
134 datafieldsbyvalue | datafieldsbyregex | datafieldbylist |
135 datafielduniquebylist | datafieldnotbylist | molnames | randomcmpds |
136 recordnum | recordnums | recordrange | 2dcmpdrecords | 3dcmpdrecords*
137 Specify what to extract from *SDFile(s)*. Possible values:
138 *alldatafields, commondatafields, datafields, datafieldsbyvalue,
139 datafieldsbyregex, datafieldbylist, datafielduniquebylist,
140 datafieldnotbylist, molnames, randomcmpds, recordnum, recordnums,
141 recordrange, 2dcmpdrecords, 3dcmpdrecords*. Default value:
142 *alldatafields*.
143
144 For *alldatafields* and *molnames* mode, only a CSV/TSV text file is
145 generated; for all other modes, however, a SD file is generated by
146 default - you can change the behavior to genereate text file using
147 *--output* option.
148
149 For *3DCmpdRecords* mode, only those compounds with at least one
150 non-zero value for Z atomic coordinates are retrieved; however,
151 during retrieval of compounds in *2DCmpdRecords* mode, all Z atomic
152 coordinates must be zero.
153
154 -n, --numofcmpds *number*
155 Number of compouds to extract during *randomcmpds* mode.
156
157 --outdelim *comma | tab | semicolon*
158 Delimiter for output CSV/TSV text file(s). Possible values: *comma,
159 tab, or semicolon* Default value: *comma*
160
161 --output *SD | text | both*
162 Type of output files to generate. Possible values: *SD, text, or
163 both*. Default value: *SD*. For *alldatafields* and *molnames* mode,
164 this option is ingored and only a CSV/TSV text file is generated.
165
166 -o, --overwrite
167 Overwrite existing files.
168
169 -q, --quote *yes | no*
170 Put quote around column values in output CSV/TSV text file(s).
171 Possible values: *yes or no*. Default value: *yes*.
172
173 --record *recnum | recnums | startrecnum,endrecnum*
174 Record number, record numbers or range of records to extract during
175 *recordnum*, *recordnums* and *recordrange* mode. Input value format
176 is: <num>, <num1,num2,...> and <startnum, endnum> for *recordnum*,
177 *recordnums* and *recordrange* modes recpectively. Default value:
178 none.
179
180 --RegexIgnoreCase *yes or no*
181 Specify whether to ingnore case during *datafieldsbyregex* value of
182 -m, --mode option. Possible values: *yes or no*. Default value:
183 *yes*.
184
185 -r, --root *rootname*
186 New file name is generated using the root: <Root>.<Ext>. Default for
187 new file names: <SDFileName><mode>.<Ext>. The file type determines
188 <Ext> value. The sdf, csv, and tsv <Ext> values are used for SD,
189 comma/semicolon, and tab delimited text files respectively.This
190 option is ignored for multiple input files.
191
192 -s, --seed *number*
193 Random number seed used for *randomcmpds* mode. Default:123456789.
194
195 --StrDataString *yes | no*
196 Specify whether to write out structure data string to CSV/TSV text
197 file(s). Possible values: *yes or no*. Default value: *no*.
198
199 The value of StrDataStringDelimiter option is used as a delimiter to
200 join structure data lines into a structure data string.
201
202 This option is ignored during generation of SD file(s).
203
204 --StrDataStringDelimiter *text*
205 Delimiter for joining multiple stucture data lines into a string
206 before writing to CSV/TSV text file(s). Possible values: *any
207 alphanumeric text*. Default value: *|*.
208
209 This option is ignored during generation of SD file(s).
210
211 --StrDataStringMode *StrOnly | StrAndDataFields*
212 Specify whether to include SD data fields and values along with the
213 structure data into structure data string before writing it out to
214 CSV/TSV text file(s). Possible values: *StrOnly or
215 StrAndDataFields*. Default value: *StrOnly*.
216
217 The value of StrDataStringDelimiter option is used as a delimiter to
218 join structure data lines into a structure data string.
219
220 This option is ignored during generation of SD file(s).
221
222 --ValueComparisonMode *Numeric | Alphanumeric*
223 Specify how to compare data field values during *datafieldsbyvalue*
224 mode: Compare values using either numeric or string ((eq, le, ge)
225 comparison operators. Possible values: *Numeric or Alphanumeric*.
226 Defaule value: *Numeric*.
227
228 -v, --violations *number*
229 Number of criterion violations allowed for values specified during
230 *datafieldsbyvalue* and *datafieldsbyregex* mode. Default value:
231 *0*.
232
233 -w, --workingdir *dirname*
234 Location of working directory. Default: current directory.
235
236 EXAMPLES
237 To retrieve all data fields from SD files and generate CSV text files,
238 type:
239
240 % ExtractFromSDFiles.pl -o Sample.sdf
241 % ExtractFromSDFiles.pl -o *.sdf
242
243 To retrieve all data fields from SD file and generate CSV text files
244 containing a column with structure data as a string with | as line
245 delimiter, type:
246
247 % ExtractFromSDFiles.pl --StrDataString Yes -o Sample.sdf
248
249 To retrieve MOL_ID data fileld from SD file and generate CSV text files
250 containing a column with structure data along with all data fields as a
251 string with | as line delimiter, type:
252
253 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID" --StrDataString Yes
254 --StrDataStringMode StrAndDataFields --StrDataStringDelimiter "|"
255 --output text -o Sample.sdf
256
257 To retrieve common data fields which exists for all the compounds in a
258 SD file and generate a TSV text file NewSample.tsv, type:
259
260 % ExtractFromSDFiles.pl -m commondatafields --outdelim tab -r NewSample
261 --output Text -o Sample.sdf
262
263 To retrieve MolId, ExtReg, and CompoundName data field from a SD file
264 and generate a CSV text file NewSample.csv, type:
265
266 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID,MolWeight,
267 CompoundName" -r NewSample --output Text -o Sample.sdf
268
269 To retrieve compounds from a SD which meet a specific set of criteria -
270 MolWt <= 450, LogP <= 5 and SumNO < 10 - from a SD file and generate a
271 new SD file NewSample.sdf, type:
272
273 % ExtractFromSDFiles.pl -m datafieldsbyvalue -d "MolWt,450,le,LogP
274 ,5,le,SumNO,10" -r NewSample -o Sample.sdf
275
276 To retrive compounds from a SD file with a specific set of values for
277 MolID and generate a new SD file NewSample.sdf, type:
278
279 % ExtractFromSDFiles.pl -m datafieldbylist -d "Mol_ID,159,4509,4619"
280 -r NewSample -o Sample.sdf
281
282 To retrive compounds from a SD file with values for MolID not on a list
283 of specified values and generate a new SD file NewSample.sdf, type:
284
285 % ExtractFromSDFiles.pl -m datafieldnotbylist -d "Mol_ID,159,4509,4619"
286 -r NewSample -o Sample.sdf
287
288 To retrive 10 random compounds from a SD file and generate a new SD file
289 RandomSample.sdf, type:
290
291 % ExtractFromSDFiles.pl -m randomcmpds -n 10 -r RandomSample
292 -o Sample.sdf
293
294 To retrive compound record number 10 from a SD file and generate a new
295 SD file NewSample.sdf, type:
296
297 % ExtractFromSDFiles.pl -m recordnum --record 10 -r NewSample
298 -o Sample.sdf
299
300 To retrive compound record numbers 10, 20 and 30 from a SD file and
301 generate a new SD file NewSample.sdf, type:
302
303 % ExtractFromSDFiles.pl -m recordnums --record 10,20,30 -r NewSample
304 -o Sample.sdf
305
306 To retrive compound records between 10 to 20 from SD file and generate a
307 new SD file NewSample.sdf, type:
308
309 % ExtractFromSDFiles.pl -m recordrange --record 10,20 -r NewSample
310 -o Sample.sdf
311
312 AUTHOR
313 Manish Sud <msud@san.rr.com>
314
315 SEE ALSO
316 FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl,
317 MergeTextFilesWithSD.pl
318
319 COPYRIGHT
320 Copyright (C) 2015 Manish Sud. All rights reserved.
321
322 This file is part of MayaChemTools.
323
324 MayaChemTools is free software; you can redistribute it and/or modify it
325 under the terms of the GNU Lesser General Public License as published by
326 the Free Software Foundation; either version 3 of the License, or (at
327 your option) any later version.
328