Mercurial > repos > deepakjadmin > mayatool3_test2
comparison docs/scripts/txt/ExtractFromSDFiles.txt @ 0:4816e4a8ae95 draft default tip
Uploaded
author | deepakjadmin |
---|---|
date | Wed, 20 Jan 2016 09:23:18 -0500 |
parents | |
children |
comparison
equal
deleted
inserted
replaced
-1:000000000000 | 0:4816e4a8ae95 |
---|---|
1 NAME | |
2 ExtractFromSDFiles.pl - Extract specific data from SDFile(s) | |
3 | |
4 SYNOPSIS | |
5 ExtractFromSDFiles.pl SDFile(s)... | |
6 | |
7 ExtractFromSDFiles.pl [-h, --help] [-d, --datafields "fieldlabel,..." | | |
8 "fieldlabel,value,criteria..." | "fieldlabel,value,value..."] | |
9 [--datafieldsfile filename] [--indelim comma | tab | semicolon] [-m, | |
10 --mode alldatafields | commondatafields | | datafieldnotbylist | | |
11 datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist | | |
12 datafielduniquebylist | molnames | randomcmpds | recordnum | recordnums | |
13 | recordrange | 2dcmpdrecords | 3dcmpdrecords ] [-n, --numofcmpds | |
14 number] [--outdelim comma | tab | semicolon] [--output SD | text | both] | |
15 [-o, --overwrite] [-q, --quote yes | no] [--record recnum | | |
16 startrecnum,endrecnum] --RegexIgnoreCase *yes or no* [-r, --root | |
17 rootname] [-s, --seed number] [--StrDataString yes | no] | |
18 [--StrDataStringDelimiter text] [--StrDataStringMode StrOnly | | |
19 StrAndDataFields] [--ValueComparisonMode *Numeric | Alphanumeric*] [-v, | |
20 --violations- number] [-w, --workingdir dirname] SDFile(s)... | |
21 | |
22 DESCRIPTION | |
23 Extract specific data from *SDFile(s)* and generate appropriate SD or | |
24 CSV/TSV text file(s). The structure data from SDFile(s) is not | |
25 transferred to CSV/TSV text file(s). Multiple SDFile names are separated | |
26 by spaces. The valid file extensions are *.sdf* and *.sd*. All other | |
27 file names are ignored. All the SD files in a current directory can be | |
28 specified either by **.sdf* or the current directory name. | |
29 | |
30 OPTIONS | |
31 -h, --help | |
32 Print this help message. | |
33 | |
34 -d, --datafields *"fieldlabel,..." | "fieldlabel,value,criteria..." | | |
35 "fieldlabel,value,value,..."* | |
36 This value is mode specific. In general, it's a list of comma | |
37 separated data field labels and associated mode specific values. | |
38 | |
39 For *datafields* mode, input value format is: *fieldlabel,...*. | |
40 Examples: | |
41 | |
42 Extreg | |
43 Extreg,CompoundName,ID | |
44 | |
45 For *datafieldsbyvalue* mode, input value format contains these | |
46 triplets: *fieldlabel,value, criteria...*. Possible values for | |
47 criteria: *le, ge or eq*. The values of --ValueComparisonMode | |
48 indicates whether values are compared numerical or string comarison | |
49 operators. Default is to consider data field values as numerical | |
50 values and use numerical comparison operators. Examples: | |
51 | |
52 MolWt,450,le | |
53 MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le | |
54 | |
55 For *datafieldsbyregex* mode, input value format contains these | |
56 triplets: *fieldlabel,regex, criteria...*. *regex* corresponds to | |
57 any valid regular expression and is used to match the values for | |
58 specified *fieldlabel*. Possible values for criteria: *eq or ne*. | |
59 During *eq* and *ne* values, data field label value is matched with | |
60 regular expression using =~ and !~ respectively. --RegexIgnoreCase | |
61 option value is used to determine whether to ignore letter | |
62 upper/lower case during regular expression match. Examples: | |
63 | |
64 Name,ol,eq | |
65 Name,'^pat',ne | |
66 | |
67 For *datafieldbylist* and *datafielduniquebylist* mode, input value | |
68 format is: *fieldlabel,value1,value2...*. This is equivalent to | |
69 *datafieldsbyvalue* mode with this input value | |
70 format:*fieldlabel,value1,eq,fieldlabel,value2,eq,...*. For | |
71 *datafielduniquebylist* mode, only unique compounds identified by | |
72 first occurrence of *value* associated with *fieldlabel* in | |
73 *SDFile(s)* are kept; any subsequent compounds are simply ignored. | |
74 | |
75 For *datafieldnotbylist* mode, input value format is: | |
76 *fieldlabel,value1,value2...*. In this mode, the script behaves | |
77 exactly opposite of *datafieldbylist* mode, and only those compounds | |
78 are extracted whose data field values don't match any specified data | |
79 field value. | |
80 | |
81 --datafieldsfile *filename* | |
82 Filename which contains various mode specific values. This option | |
83 provides a way to specify mode specific values in a file instead of | |
84 entering them on the command line using -d --datafields. | |
85 | |
86 For *datafields* mode, input file lines contain comma delimited | |
87 field labels: *fieldlabel,...*. Example: | |
88 | |
89 Line 1:MolId | |
90 Line 2:"Extreg",CompoundName,ID | |
91 | |
92 For *datafieldsbyvalue* mode, input file lines contains these comma | |
93 separated triplets: *fieldlabel,value, criteria*. Possible values | |
94 for criteria: *le, ge or eq*. Examples: | |
95 | |
96 Line 1:MolWt,450,le | |
97 | |
98 Line 1:"MolWt",450,le,"LogP",5,le,"SumNumNO",10,le,"SumNHOH",5,le | |
99 | |
100 Line 1:MolWt,450,le | |
101 Line 2:"LogP",5,le | |
102 Line 3:"SumNumNO",10,le | |
103 Line 4: SumNHOH,5,le | |
104 | |
105 For *datafieldbylist* and *datafielduniquebylist* mode, input file | |
106 line format is: | |
107 | |
108 Line 1:fieldlabel; | |
109 Subsequent lines:value1,value2... | |
110 | |
111 For *datafieldbylist*, *datafielduniquebylist*, and | |
112 *datafieldnotbylist* mode, input file line format is: | |
113 | |
114 Line 1:fieldlabel; | |
115 Subsequent lines:value1,value2... | |
116 | |
117 For *datafielduniquebylist* mode, only unique compounds identified | |
118 by first occurrence of *value* associated with *fieldlabel* in | |
119 *SDFile(s)* are kept; any subsequent compounds are simply ignored. | |
120 Example: | |
121 | |
122 Line 1: MolID | |
123 Subsequent Lines: | |
124 907508 | |
125 832291,4642 | |
126 "1254","907303" | |
127 | |
128 --indelim *comma | tab | semicolon* | |
129 Delimiter used to specify text values for -d --datafields and | |
130 --datafieldsfile options. Possible values: *comma, tab, or | |
131 semicolon*. Default value: *comma*. | |
132 | |
133 -m, --mode *alldatafields | commondatafields | datafields | | |
134 datafieldsbyvalue | datafieldsbyregex | datafieldbylist | | |
135 datafielduniquebylist | datafieldnotbylist | molnames | randomcmpds | | |
136 recordnum | recordnums | recordrange | 2dcmpdrecords | 3dcmpdrecords* | |
137 Specify what to extract from *SDFile(s)*. Possible values: | |
138 *alldatafields, commondatafields, datafields, datafieldsbyvalue, | |
139 datafieldsbyregex, datafieldbylist, datafielduniquebylist, | |
140 datafieldnotbylist, molnames, randomcmpds, recordnum, recordnums, | |
141 recordrange, 2dcmpdrecords, 3dcmpdrecords*. Default value: | |
142 *alldatafields*. | |
143 | |
144 For *alldatafields* and *molnames* mode, only a CSV/TSV text file is | |
145 generated; for all other modes, however, a SD file is generated by | |
146 default - you can change the behavior to genereate text file using | |
147 *--output* option. | |
148 | |
149 For *3DCmpdRecords* mode, only those compounds with at least one | |
150 non-zero value for Z atomic coordinates are retrieved; however, | |
151 during retrieval of compounds in *2DCmpdRecords* mode, all Z atomic | |
152 coordinates must be zero. | |
153 | |
154 -n, --numofcmpds *number* | |
155 Number of compouds to extract during *randomcmpds* mode. | |
156 | |
157 --outdelim *comma | tab | semicolon* | |
158 Delimiter for output CSV/TSV text file(s). Possible values: *comma, | |
159 tab, or semicolon* Default value: *comma* | |
160 | |
161 --output *SD | text | both* | |
162 Type of output files to generate. Possible values: *SD, text, or | |
163 both*. Default value: *SD*. For *alldatafields* and *molnames* mode, | |
164 this option is ingored and only a CSV/TSV text file is generated. | |
165 | |
166 -o, --overwrite | |
167 Overwrite existing files. | |
168 | |
169 -q, --quote *yes | no* | |
170 Put quote around column values in output CSV/TSV text file(s). | |
171 Possible values: *yes or no*. Default value: *yes*. | |
172 | |
173 --record *recnum | recnums | startrecnum,endrecnum* | |
174 Record number, record numbers or range of records to extract during | |
175 *recordnum*, *recordnums* and *recordrange* mode. Input value format | |
176 is: <num>, <num1,num2,...> and <startnum, endnum> for *recordnum*, | |
177 *recordnums* and *recordrange* modes recpectively. Default value: | |
178 none. | |
179 | |
180 --RegexIgnoreCase *yes or no* | |
181 Specify whether to ingnore case during *datafieldsbyregex* value of | |
182 -m, --mode option. Possible values: *yes or no*. Default value: | |
183 *yes*. | |
184 | |
185 -r, --root *rootname* | |
186 New file name is generated using the root: <Root>.<Ext>. Default for | |
187 new file names: <SDFileName><mode>.<Ext>. The file type determines | |
188 <Ext> value. The sdf, csv, and tsv <Ext> values are used for SD, | |
189 comma/semicolon, and tab delimited text files respectively.This | |
190 option is ignored for multiple input files. | |
191 | |
192 -s, --seed *number* | |
193 Random number seed used for *randomcmpds* mode. Default:123456789. | |
194 | |
195 --StrDataString *yes | no* | |
196 Specify whether to write out structure data string to CSV/TSV text | |
197 file(s). Possible values: *yes or no*. Default value: *no*. | |
198 | |
199 The value of StrDataStringDelimiter option is used as a delimiter to | |
200 join structure data lines into a structure data string. | |
201 | |
202 This option is ignored during generation of SD file(s). | |
203 | |
204 --StrDataStringDelimiter *text* | |
205 Delimiter for joining multiple stucture data lines into a string | |
206 before writing to CSV/TSV text file(s). Possible values: *any | |
207 alphanumeric text*. Default value: *|*. | |
208 | |
209 This option is ignored during generation of SD file(s). | |
210 | |
211 --StrDataStringMode *StrOnly | StrAndDataFields* | |
212 Specify whether to include SD data fields and values along with the | |
213 structure data into structure data string before writing it out to | |
214 CSV/TSV text file(s). Possible values: *StrOnly or | |
215 StrAndDataFields*. Default value: *StrOnly*. | |
216 | |
217 The value of StrDataStringDelimiter option is used as a delimiter to | |
218 join structure data lines into a structure data string. | |
219 | |
220 This option is ignored during generation of SD file(s). | |
221 | |
222 --ValueComparisonMode *Numeric | Alphanumeric* | |
223 Specify how to compare data field values during *datafieldsbyvalue* | |
224 mode: Compare values using either numeric or string ((eq, le, ge) | |
225 comparison operators. Possible values: *Numeric or Alphanumeric*. | |
226 Defaule value: *Numeric*. | |
227 | |
228 -v, --violations *number* | |
229 Number of criterion violations allowed for values specified during | |
230 *datafieldsbyvalue* and *datafieldsbyregex* mode. Default value: | |
231 *0*. | |
232 | |
233 -w, --workingdir *dirname* | |
234 Location of working directory. Default: current directory. | |
235 | |
236 EXAMPLES | |
237 To retrieve all data fields from SD files and generate CSV text files, | |
238 type: | |
239 | |
240 % ExtractFromSDFiles.pl -o Sample.sdf | |
241 % ExtractFromSDFiles.pl -o *.sdf | |
242 | |
243 To retrieve all data fields from SD file and generate CSV text files | |
244 containing a column with structure data as a string with | as line | |
245 delimiter, type: | |
246 | |
247 % ExtractFromSDFiles.pl --StrDataString Yes -o Sample.sdf | |
248 | |
249 To retrieve MOL_ID data fileld from SD file and generate CSV text files | |
250 containing a column with structure data along with all data fields as a | |
251 string with | as line delimiter, type: | |
252 | |
253 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID" --StrDataString Yes | |
254 --StrDataStringMode StrAndDataFields --StrDataStringDelimiter "|" | |
255 --output text -o Sample.sdf | |
256 | |
257 To retrieve common data fields which exists for all the compounds in a | |
258 SD file and generate a TSV text file NewSample.tsv, type: | |
259 | |
260 % ExtractFromSDFiles.pl -m commondatafields --outdelim tab -r NewSample | |
261 --output Text -o Sample.sdf | |
262 | |
263 To retrieve MolId, ExtReg, and CompoundName data field from a SD file | |
264 and generate a CSV text file NewSample.csv, type: | |
265 | |
266 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID,MolWeight, | |
267 CompoundName" -r NewSample --output Text -o Sample.sdf | |
268 | |
269 To retrieve compounds from a SD which meet a specific set of criteria - | |
270 MolWt <= 450, LogP <= 5 and SumNO < 10 - from a SD file and generate a | |
271 new SD file NewSample.sdf, type: | |
272 | |
273 % ExtractFromSDFiles.pl -m datafieldsbyvalue -d "MolWt,450,le,LogP | |
274 ,5,le,SumNO,10" -r NewSample -o Sample.sdf | |
275 | |
276 To retrive compounds from a SD file with a specific set of values for | |
277 MolID and generate a new SD file NewSample.sdf, type: | |
278 | |
279 % ExtractFromSDFiles.pl -m datafieldbylist -d "Mol_ID,159,4509,4619" | |
280 -r NewSample -o Sample.sdf | |
281 | |
282 To retrive compounds from a SD file with values for MolID not on a list | |
283 of specified values and generate a new SD file NewSample.sdf, type: | |
284 | |
285 % ExtractFromSDFiles.pl -m datafieldnotbylist -d "Mol_ID,159,4509,4619" | |
286 -r NewSample -o Sample.sdf | |
287 | |
288 To retrive 10 random compounds from a SD file and generate a new SD file | |
289 RandomSample.sdf, type: | |
290 | |
291 % ExtractFromSDFiles.pl -m randomcmpds -n 10 -r RandomSample | |
292 -o Sample.sdf | |
293 | |
294 To retrive compound record number 10 from a SD file and generate a new | |
295 SD file NewSample.sdf, type: | |
296 | |
297 % ExtractFromSDFiles.pl -m recordnum --record 10 -r NewSample | |
298 -o Sample.sdf | |
299 | |
300 To retrive compound record numbers 10, 20 and 30 from a SD file and | |
301 generate a new SD file NewSample.sdf, type: | |
302 | |
303 % ExtractFromSDFiles.pl -m recordnums --record 10,20,30 -r NewSample | |
304 -o Sample.sdf | |
305 | |
306 To retrive compound records between 10 to 20 from SD file and generate a | |
307 new SD file NewSample.sdf, type: | |
308 | |
309 % ExtractFromSDFiles.pl -m recordrange --record 10,20 -r NewSample | |
310 -o Sample.sdf | |
311 | |
312 AUTHOR | |
313 Manish Sud <msud@san.rr.com> | |
314 | |
315 SEE ALSO | |
316 FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl, | |
317 MergeTextFilesWithSD.pl | |
318 | |
319 COPYRIGHT | |
320 Copyright (C) 2015 Manish Sud. All rights reserved. | |
321 | |
322 This file is part of MayaChemTools. | |
323 | |
324 MayaChemTools is free software; you can redistribute it and/or modify it | |
325 under the terms of the GNU Lesser General Public License as published by | |
326 the Free Software Foundation; either version 3 of the License, or (at | |
327 your option) any later version. | |
328 |