0
|
1 NAME
|
|
2 ExtractFromSDFiles.pl - Extract specific data from SDFile(s)
|
|
3
|
|
4 SYNOPSIS
|
|
5 ExtractFromSDFiles.pl SDFile(s)...
|
|
6
|
|
7 ExtractFromSDFiles.pl [-h, --help] [-d, --datafields "fieldlabel,..." |
|
|
8 "fieldlabel,value,criteria..." | "fieldlabel,value,value..."]
|
|
9 [--datafieldsfile filename] [--indelim comma | tab | semicolon] [-m,
|
|
10 --mode alldatafields | commondatafields | | datafieldnotbylist |
|
|
11 datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist |
|
|
12 datafielduniquebylist | molnames | randomcmpds | recordnum | recordnums
|
|
13 | recordrange | 2dcmpdrecords | 3dcmpdrecords ] [-n, --numofcmpds
|
|
14 number] [--outdelim comma | tab | semicolon] [--output SD | text | both]
|
|
15 [-o, --overwrite] [-q, --quote yes | no] [--record recnum |
|
|
16 startrecnum,endrecnum] --RegexIgnoreCase *yes or no* [-r, --root
|
|
17 rootname] [-s, --seed number] [--StrDataString yes | no]
|
|
18 [--StrDataStringDelimiter text] [--StrDataStringMode StrOnly |
|
|
19 StrAndDataFields] [--ValueComparisonMode *Numeric | Alphanumeric*] [-v,
|
|
20 --violations- number] [-w, --workingdir dirname] SDFile(s)...
|
|
21
|
|
22 DESCRIPTION
|
|
23 Extract specific data from *SDFile(s)* and generate appropriate SD or
|
|
24 CSV/TSV text file(s). The structure data from SDFile(s) is not
|
|
25 transferred to CSV/TSV text file(s). Multiple SDFile names are separated
|
|
26 by spaces. The valid file extensions are *.sdf* and *.sd*. All other
|
|
27 file names are ignored. All the SD files in a current directory can be
|
|
28 specified either by **.sdf* or the current directory name.
|
|
29
|
|
30 OPTIONS
|
|
31 -h, --help
|
|
32 Print this help message.
|
|
33
|
|
34 -d, --datafields *"fieldlabel,..." | "fieldlabel,value,criteria..." |
|
|
35 "fieldlabel,value,value,..."*
|
|
36 This value is mode specific. In general, it's a list of comma
|
|
37 separated data field labels and associated mode specific values.
|
|
38
|
|
39 For *datafields* mode, input value format is: *fieldlabel,...*.
|
|
40 Examples:
|
|
41
|
|
42 Extreg
|
|
43 Extreg,CompoundName,ID
|
|
44
|
|
45 For *datafieldsbyvalue* mode, input value format contains these
|
|
46 triplets: *fieldlabel,value, criteria...*. Possible values for
|
|
47 criteria: *le, ge or eq*. The values of --ValueComparisonMode
|
|
48 indicates whether values are compared numerical or string comarison
|
|
49 operators. Default is to consider data field values as numerical
|
|
50 values and use numerical comparison operators. Examples:
|
|
51
|
|
52 MolWt,450,le
|
|
53 MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le
|
|
54
|
|
55 For *datafieldsbyregex* mode, input value format contains these
|
|
56 triplets: *fieldlabel,regex, criteria...*. *regex* corresponds to
|
|
57 any valid regular expression and is used to match the values for
|
|
58 specified *fieldlabel*. Possible values for criteria: *eq or ne*.
|
|
59 During *eq* and *ne* values, data field label value is matched with
|
|
60 regular expression using =~ and !~ respectively. --RegexIgnoreCase
|
|
61 option value is used to determine whether to ignore letter
|
|
62 upper/lower case during regular expression match. Examples:
|
|
63
|
|
64 Name,ol,eq
|
|
65 Name,'^pat',ne
|
|
66
|
|
67 For *datafieldbylist* and *datafielduniquebylist* mode, input value
|
|
68 format is: *fieldlabel,value1,value2...*. This is equivalent to
|
|
69 *datafieldsbyvalue* mode with this input value
|
|
70 format:*fieldlabel,value1,eq,fieldlabel,value2,eq,...*. For
|
|
71 *datafielduniquebylist* mode, only unique compounds identified by
|
|
72 first occurrence of *value* associated with *fieldlabel* in
|
|
73 *SDFile(s)* are kept; any subsequent compounds are simply ignored.
|
|
74
|
|
75 For *datafieldnotbylist* mode, input value format is:
|
|
76 *fieldlabel,value1,value2...*. In this mode, the script behaves
|
|
77 exactly opposite of *datafieldbylist* mode, and only those compounds
|
|
78 are extracted whose data field values don't match any specified data
|
|
79 field value.
|
|
80
|
|
81 --datafieldsfile *filename*
|
|
82 Filename which contains various mode specific values. This option
|
|
83 provides a way to specify mode specific values in a file instead of
|
|
84 entering them on the command line using -d --datafields.
|
|
85
|
|
86 For *datafields* mode, input file lines contain comma delimited
|
|
87 field labels: *fieldlabel,...*. Example:
|
|
88
|
|
89 Line 1:MolId
|
|
90 Line 2:"Extreg",CompoundName,ID
|
|
91
|
|
92 For *datafieldsbyvalue* mode, input file lines contains these comma
|
|
93 separated triplets: *fieldlabel,value, criteria*. Possible values
|
|
94 for criteria: *le, ge or eq*. Examples:
|
|
95
|
|
96 Line 1:MolWt,450,le
|
|
97
|
|
98 Line 1:"MolWt",450,le,"LogP",5,le,"SumNumNO",10,le,"SumNHOH",5,le
|
|
99
|
|
100 Line 1:MolWt,450,le
|
|
101 Line 2:"LogP",5,le
|
|
102 Line 3:"SumNumNO",10,le
|
|
103 Line 4: SumNHOH,5,le
|
|
104
|
|
105 For *datafieldbylist* and *datafielduniquebylist* mode, input file
|
|
106 line format is:
|
|
107
|
|
108 Line 1:fieldlabel;
|
|
109 Subsequent lines:value1,value2...
|
|
110
|
|
111 For *datafieldbylist*, *datafielduniquebylist*, and
|
|
112 *datafieldnotbylist* mode, input file line format is:
|
|
113
|
|
114 Line 1:fieldlabel;
|
|
115 Subsequent lines:value1,value2...
|
|
116
|
|
117 For *datafielduniquebylist* mode, only unique compounds identified
|
|
118 by first occurrence of *value* associated with *fieldlabel* in
|
|
119 *SDFile(s)* are kept; any subsequent compounds are simply ignored.
|
|
120 Example:
|
|
121
|
|
122 Line 1: MolID
|
|
123 Subsequent Lines:
|
|
124 907508
|
|
125 832291,4642
|
|
126 "1254","907303"
|
|
127
|
|
128 --indelim *comma | tab | semicolon*
|
|
129 Delimiter used to specify text values for -d --datafields and
|
|
130 --datafieldsfile options. Possible values: *comma, tab, or
|
|
131 semicolon*. Default value: *comma*.
|
|
132
|
|
133 -m, --mode *alldatafields | commondatafields | datafields |
|
|
134 datafieldsbyvalue | datafieldsbyregex | datafieldbylist |
|
|
135 datafielduniquebylist | datafieldnotbylist | molnames | randomcmpds |
|
|
136 recordnum | recordnums | recordrange | 2dcmpdrecords | 3dcmpdrecords*
|
|
137 Specify what to extract from *SDFile(s)*. Possible values:
|
|
138 *alldatafields, commondatafields, datafields, datafieldsbyvalue,
|
|
139 datafieldsbyregex, datafieldbylist, datafielduniquebylist,
|
|
140 datafieldnotbylist, molnames, randomcmpds, recordnum, recordnums,
|
|
141 recordrange, 2dcmpdrecords, 3dcmpdrecords*. Default value:
|
|
142 *alldatafields*.
|
|
143
|
|
144 For *alldatafields* and *molnames* mode, only a CSV/TSV text file is
|
|
145 generated; for all other modes, however, a SD file is generated by
|
|
146 default - you can change the behavior to genereate text file using
|
|
147 *--output* option.
|
|
148
|
|
149 For *3DCmpdRecords* mode, only those compounds with at least one
|
|
150 non-zero value for Z atomic coordinates are retrieved; however,
|
|
151 during retrieval of compounds in *2DCmpdRecords* mode, all Z atomic
|
|
152 coordinates must be zero.
|
|
153
|
|
154 -n, --numofcmpds *number*
|
|
155 Number of compouds to extract during *randomcmpds* mode.
|
|
156
|
|
157 --outdelim *comma | tab | semicolon*
|
|
158 Delimiter for output CSV/TSV text file(s). Possible values: *comma,
|
|
159 tab, or semicolon* Default value: *comma*
|
|
160
|
|
161 --output *SD | text | both*
|
|
162 Type of output files to generate. Possible values: *SD, text, or
|
|
163 both*. Default value: *SD*. For *alldatafields* and *molnames* mode,
|
|
164 this option is ingored and only a CSV/TSV text file is generated.
|
|
165
|
|
166 -o, --overwrite
|
|
167 Overwrite existing files.
|
|
168
|
|
169 -q, --quote *yes | no*
|
|
170 Put quote around column values in output CSV/TSV text file(s).
|
|
171 Possible values: *yes or no*. Default value: *yes*.
|
|
172
|
|
173 --record *recnum | recnums | startrecnum,endrecnum*
|
|
174 Record number, record numbers or range of records to extract during
|
|
175 *recordnum*, *recordnums* and *recordrange* mode. Input value format
|
|
176 is: <num>, <num1,num2,...> and <startnum, endnum> for *recordnum*,
|
|
177 *recordnums* and *recordrange* modes recpectively. Default value:
|
|
178 none.
|
|
179
|
|
180 --RegexIgnoreCase *yes or no*
|
|
181 Specify whether to ingnore case during *datafieldsbyregex* value of
|
|
182 -m, --mode option. Possible values: *yes or no*. Default value:
|
|
183 *yes*.
|
|
184
|
|
185 -r, --root *rootname*
|
|
186 New file name is generated using the root: <Root>.<Ext>. Default for
|
|
187 new file names: <SDFileName><mode>.<Ext>. The file type determines
|
|
188 <Ext> value. The sdf, csv, and tsv <Ext> values are used for SD,
|
|
189 comma/semicolon, and tab delimited text files respectively.This
|
|
190 option is ignored for multiple input files.
|
|
191
|
|
192 -s, --seed *number*
|
|
193 Random number seed used for *randomcmpds* mode. Default:123456789.
|
|
194
|
|
195 --StrDataString *yes | no*
|
|
196 Specify whether to write out structure data string to CSV/TSV text
|
|
197 file(s). Possible values: *yes or no*. Default value: *no*.
|
|
198
|
|
199 The value of StrDataStringDelimiter option is used as a delimiter to
|
|
200 join structure data lines into a structure data string.
|
|
201
|
|
202 This option is ignored during generation of SD file(s).
|
|
203
|
|
204 --StrDataStringDelimiter *text*
|
|
205 Delimiter for joining multiple stucture data lines into a string
|
|
206 before writing to CSV/TSV text file(s). Possible values: *any
|
|
207 alphanumeric text*. Default value: *|*.
|
|
208
|
|
209 This option is ignored during generation of SD file(s).
|
|
210
|
|
211 --StrDataStringMode *StrOnly | StrAndDataFields*
|
|
212 Specify whether to include SD data fields and values along with the
|
|
213 structure data into structure data string before writing it out to
|
|
214 CSV/TSV text file(s). Possible values: *StrOnly or
|
|
215 StrAndDataFields*. Default value: *StrOnly*.
|
|
216
|
|
217 The value of StrDataStringDelimiter option is used as a delimiter to
|
|
218 join structure data lines into a structure data string.
|
|
219
|
|
220 This option is ignored during generation of SD file(s).
|
|
221
|
|
222 --ValueComparisonMode *Numeric | Alphanumeric*
|
|
223 Specify how to compare data field values during *datafieldsbyvalue*
|
|
224 mode: Compare values using either numeric or string ((eq, le, ge)
|
|
225 comparison operators. Possible values: *Numeric or Alphanumeric*.
|
|
226 Defaule value: *Numeric*.
|
|
227
|
|
228 -v, --violations *number*
|
|
229 Number of criterion violations allowed for values specified during
|
|
230 *datafieldsbyvalue* and *datafieldsbyregex* mode. Default value:
|
|
231 *0*.
|
|
232
|
|
233 -w, --workingdir *dirname*
|
|
234 Location of working directory. Default: current directory.
|
|
235
|
|
236 EXAMPLES
|
|
237 To retrieve all data fields from SD files and generate CSV text files,
|
|
238 type:
|
|
239
|
|
240 % ExtractFromSDFiles.pl -o Sample.sdf
|
|
241 % ExtractFromSDFiles.pl -o *.sdf
|
|
242
|
|
243 To retrieve all data fields from SD file and generate CSV text files
|
|
244 containing a column with structure data as a string with | as line
|
|
245 delimiter, type:
|
|
246
|
|
247 % ExtractFromSDFiles.pl --StrDataString Yes -o Sample.sdf
|
|
248
|
|
249 To retrieve MOL_ID data fileld from SD file and generate CSV text files
|
|
250 containing a column with structure data along with all data fields as a
|
|
251 string with | as line delimiter, type:
|
|
252
|
|
253 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID" --StrDataString Yes
|
|
254 --StrDataStringMode StrAndDataFields --StrDataStringDelimiter "|"
|
|
255 --output text -o Sample.sdf
|
|
256
|
|
257 To retrieve common data fields which exists for all the compounds in a
|
|
258 SD file and generate a TSV text file NewSample.tsv, type:
|
|
259
|
|
260 % ExtractFromSDFiles.pl -m commondatafields --outdelim tab -r NewSample
|
|
261 --output Text -o Sample.sdf
|
|
262
|
|
263 To retrieve MolId, ExtReg, and CompoundName data field from a SD file
|
|
264 and generate a CSV text file NewSample.csv, type:
|
|
265
|
|
266 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID,MolWeight,
|
|
267 CompoundName" -r NewSample --output Text -o Sample.sdf
|
|
268
|
|
269 To retrieve compounds from a SD which meet a specific set of criteria -
|
|
270 MolWt <= 450, LogP <= 5 and SumNO < 10 - from a SD file and generate a
|
|
271 new SD file NewSample.sdf, type:
|
|
272
|
|
273 % ExtractFromSDFiles.pl -m datafieldsbyvalue -d "MolWt,450,le,LogP
|
|
274 ,5,le,SumNO,10" -r NewSample -o Sample.sdf
|
|
275
|
|
276 To retrive compounds from a SD file with a specific set of values for
|
|
277 MolID and generate a new SD file NewSample.sdf, type:
|
|
278
|
|
279 % ExtractFromSDFiles.pl -m datafieldbylist -d "Mol_ID,159,4509,4619"
|
|
280 -r NewSample -o Sample.sdf
|
|
281
|
|
282 To retrive compounds from a SD file with values for MolID not on a list
|
|
283 of specified values and generate a new SD file NewSample.sdf, type:
|
|
284
|
|
285 % ExtractFromSDFiles.pl -m datafieldnotbylist -d "Mol_ID,159,4509,4619"
|
|
286 -r NewSample -o Sample.sdf
|
|
287
|
|
288 To retrive 10 random compounds from a SD file and generate a new SD file
|
|
289 RandomSample.sdf, type:
|
|
290
|
|
291 % ExtractFromSDFiles.pl -m randomcmpds -n 10 -r RandomSample
|
|
292 -o Sample.sdf
|
|
293
|
|
294 To retrive compound record number 10 from a SD file and generate a new
|
|
295 SD file NewSample.sdf, type:
|
|
296
|
|
297 % ExtractFromSDFiles.pl -m recordnum --record 10 -r NewSample
|
|
298 -o Sample.sdf
|
|
299
|
|
300 To retrive compound record numbers 10, 20 and 30 from a SD file and
|
|
301 generate a new SD file NewSample.sdf, type:
|
|
302
|
|
303 % ExtractFromSDFiles.pl -m recordnums --record 10,20,30 -r NewSample
|
|
304 -o Sample.sdf
|
|
305
|
|
306 To retrive compound records between 10 to 20 from SD file and generate a
|
|
307 new SD file NewSample.sdf, type:
|
|
308
|
|
309 % ExtractFromSDFiles.pl -m recordrange --record 10,20 -r NewSample
|
|
310 -o Sample.sdf
|
|
311
|
|
312 AUTHOR
|
|
313 Manish Sud <msud@san.rr.com>
|
|
314
|
|
315 SEE ALSO
|
|
316 FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl,
|
|
317 MergeTextFilesWithSD.pl
|
|
318
|
|
319 COPYRIGHT
|
|
320 Copyright (C) 2015 Manish Sud. All rights reserved.
|
|
321
|
|
322 This file is part of MayaChemTools.
|
|
323
|
|
324 MayaChemTools is free software; you can redistribute it and/or modify it
|
|
325 under the terms of the GNU Lesser General Public License as published by
|
|
326 the Free Software Foundation; either version 3 of the License, or (at
|
|
327 your option) any later version.
|
|
328
|