Mercurial > repos > deepakjadmin > mayatool3_test2
comparison docs/scripts/txt/ExtractFromSDFiles.txt @ 0:4816e4a8ae95 draft default tip
Uploaded
| author | deepakjadmin |
|---|---|
| date | Wed, 20 Jan 2016 09:23:18 -0500 |
| parents | |
| children |
comparison
equal
deleted
inserted
replaced
| -1:000000000000 | 0:4816e4a8ae95 |
|---|---|
| 1 NAME | |
| 2 ExtractFromSDFiles.pl - Extract specific data from SDFile(s) | |
| 3 | |
| 4 SYNOPSIS | |
| 5 ExtractFromSDFiles.pl SDFile(s)... | |
| 6 | |
| 7 ExtractFromSDFiles.pl [-h, --help] [-d, --datafields "fieldlabel,..." | | |
| 8 "fieldlabel,value,criteria..." | "fieldlabel,value,value..."] | |
| 9 [--datafieldsfile filename] [--indelim comma | tab | semicolon] [-m, | |
| 10 --mode alldatafields | commondatafields | | datafieldnotbylist | | |
| 11 datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist | | |
| 12 datafielduniquebylist | molnames | randomcmpds | recordnum | recordnums | |
| 13 | recordrange | 2dcmpdrecords | 3dcmpdrecords ] [-n, --numofcmpds | |
| 14 number] [--outdelim comma | tab | semicolon] [--output SD | text | both] | |
| 15 [-o, --overwrite] [-q, --quote yes | no] [--record recnum | | |
| 16 startrecnum,endrecnum] --RegexIgnoreCase *yes or no* [-r, --root | |
| 17 rootname] [-s, --seed number] [--StrDataString yes | no] | |
| 18 [--StrDataStringDelimiter text] [--StrDataStringMode StrOnly | | |
| 19 StrAndDataFields] [--ValueComparisonMode *Numeric | Alphanumeric*] [-v, | |
| 20 --violations- number] [-w, --workingdir dirname] SDFile(s)... | |
| 21 | |
| 22 DESCRIPTION | |
| 23 Extract specific data from *SDFile(s)* and generate appropriate SD or | |
| 24 CSV/TSV text file(s). The structure data from SDFile(s) is not | |
| 25 transferred to CSV/TSV text file(s). Multiple SDFile names are separated | |
| 26 by spaces. The valid file extensions are *.sdf* and *.sd*. All other | |
| 27 file names are ignored. All the SD files in a current directory can be | |
| 28 specified either by **.sdf* or the current directory name. | |
| 29 | |
| 30 OPTIONS | |
| 31 -h, --help | |
| 32 Print this help message. | |
| 33 | |
| 34 -d, --datafields *"fieldlabel,..." | "fieldlabel,value,criteria..." | | |
| 35 "fieldlabel,value,value,..."* | |
| 36 This value is mode specific. In general, it's a list of comma | |
| 37 separated data field labels and associated mode specific values. | |
| 38 | |
| 39 For *datafields* mode, input value format is: *fieldlabel,...*. | |
| 40 Examples: | |
| 41 | |
| 42 Extreg | |
| 43 Extreg,CompoundName,ID | |
| 44 | |
| 45 For *datafieldsbyvalue* mode, input value format contains these | |
| 46 triplets: *fieldlabel,value, criteria...*. Possible values for | |
| 47 criteria: *le, ge or eq*. The values of --ValueComparisonMode | |
| 48 indicates whether values are compared numerical or string comarison | |
| 49 operators. Default is to consider data field values as numerical | |
| 50 values and use numerical comparison operators. Examples: | |
| 51 | |
| 52 MolWt,450,le | |
| 53 MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le | |
| 54 | |
| 55 For *datafieldsbyregex* mode, input value format contains these | |
| 56 triplets: *fieldlabel,regex, criteria...*. *regex* corresponds to | |
| 57 any valid regular expression and is used to match the values for | |
| 58 specified *fieldlabel*. Possible values for criteria: *eq or ne*. | |
| 59 During *eq* and *ne* values, data field label value is matched with | |
| 60 regular expression using =~ and !~ respectively. --RegexIgnoreCase | |
| 61 option value is used to determine whether to ignore letter | |
| 62 upper/lower case during regular expression match. Examples: | |
| 63 | |
| 64 Name,ol,eq | |
| 65 Name,'^pat',ne | |
| 66 | |
| 67 For *datafieldbylist* and *datafielduniquebylist* mode, input value | |
| 68 format is: *fieldlabel,value1,value2...*. This is equivalent to | |
| 69 *datafieldsbyvalue* mode with this input value | |
| 70 format:*fieldlabel,value1,eq,fieldlabel,value2,eq,...*. For | |
| 71 *datafielduniquebylist* mode, only unique compounds identified by | |
| 72 first occurrence of *value* associated with *fieldlabel* in | |
| 73 *SDFile(s)* are kept; any subsequent compounds are simply ignored. | |
| 74 | |
| 75 For *datafieldnotbylist* mode, input value format is: | |
| 76 *fieldlabel,value1,value2...*. In this mode, the script behaves | |
| 77 exactly opposite of *datafieldbylist* mode, and only those compounds | |
| 78 are extracted whose data field values don't match any specified data | |
| 79 field value. | |
| 80 | |
| 81 --datafieldsfile *filename* | |
| 82 Filename which contains various mode specific values. This option | |
| 83 provides a way to specify mode specific values in a file instead of | |
| 84 entering them on the command line using -d --datafields. | |
| 85 | |
| 86 For *datafields* mode, input file lines contain comma delimited | |
| 87 field labels: *fieldlabel,...*. Example: | |
| 88 | |
| 89 Line 1:MolId | |
| 90 Line 2:"Extreg",CompoundName,ID | |
| 91 | |
| 92 For *datafieldsbyvalue* mode, input file lines contains these comma | |
| 93 separated triplets: *fieldlabel,value, criteria*. Possible values | |
| 94 for criteria: *le, ge or eq*. Examples: | |
| 95 | |
| 96 Line 1:MolWt,450,le | |
| 97 | |
| 98 Line 1:"MolWt",450,le,"LogP",5,le,"SumNumNO",10,le,"SumNHOH",5,le | |
| 99 | |
| 100 Line 1:MolWt,450,le | |
| 101 Line 2:"LogP",5,le | |
| 102 Line 3:"SumNumNO",10,le | |
| 103 Line 4: SumNHOH,5,le | |
| 104 | |
| 105 For *datafieldbylist* and *datafielduniquebylist* mode, input file | |
| 106 line format is: | |
| 107 | |
| 108 Line 1:fieldlabel; | |
| 109 Subsequent lines:value1,value2... | |
| 110 | |
| 111 For *datafieldbylist*, *datafielduniquebylist*, and | |
| 112 *datafieldnotbylist* mode, input file line format is: | |
| 113 | |
| 114 Line 1:fieldlabel; | |
| 115 Subsequent lines:value1,value2... | |
| 116 | |
| 117 For *datafielduniquebylist* mode, only unique compounds identified | |
| 118 by first occurrence of *value* associated with *fieldlabel* in | |
| 119 *SDFile(s)* are kept; any subsequent compounds are simply ignored. | |
| 120 Example: | |
| 121 | |
| 122 Line 1: MolID | |
| 123 Subsequent Lines: | |
| 124 907508 | |
| 125 832291,4642 | |
| 126 "1254","907303" | |
| 127 | |
| 128 --indelim *comma | tab | semicolon* | |
| 129 Delimiter used to specify text values for -d --datafields and | |
| 130 --datafieldsfile options. Possible values: *comma, tab, or | |
| 131 semicolon*. Default value: *comma*. | |
| 132 | |
| 133 -m, --mode *alldatafields | commondatafields | datafields | | |
| 134 datafieldsbyvalue | datafieldsbyregex | datafieldbylist | | |
| 135 datafielduniquebylist | datafieldnotbylist | molnames | randomcmpds | | |
| 136 recordnum | recordnums | recordrange | 2dcmpdrecords | 3dcmpdrecords* | |
| 137 Specify what to extract from *SDFile(s)*. Possible values: | |
| 138 *alldatafields, commondatafields, datafields, datafieldsbyvalue, | |
| 139 datafieldsbyregex, datafieldbylist, datafielduniquebylist, | |
| 140 datafieldnotbylist, molnames, randomcmpds, recordnum, recordnums, | |
| 141 recordrange, 2dcmpdrecords, 3dcmpdrecords*. Default value: | |
| 142 *alldatafields*. | |
| 143 | |
| 144 For *alldatafields* and *molnames* mode, only a CSV/TSV text file is | |
| 145 generated; for all other modes, however, a SD file is generated by | |
| 146 default - you can change the behavior to genereate text file using | |
| 147 *--output* option. | |
| 148 | |
| 149 For *3DCmpdRecords* mode, only those compounds with at least one | |
| 150 non-zero value for Z atomic coordinates are retrieved; however, | |
| 151 during retrieval of compounds in *2DCmpdRecords* mode, all Z atomic | |
| 152 coordinates must be zero. | |
| 153 | |
| 154 -n, --numofcmpds *number* | |
| 155 Number of compouds to extract during *randomcmpds* mode. | |
| 156 | |
| 157 --outdelim *comma | tab | semicolon* | |
| 158 Delimiter for output CSV/TSV text file(s). Possible values: *comma, | |
| 159 tab, or semicolon* Default value: *comma* | |
| 160 | |
| 161 --output *SD | text | both* | |
| 162 Type of output files to generate. Possible values: *SD, text, or | |
| 163 both*. Default value: *SD*. For *alldatafields* and *molnames* mode, | |
| 164 this option is ingored and only a CSV/TSV text file is generated. | |
| 165 | |
| 166 -o, --overwrite | |
| 167 Overwrite existing files. | |
| 168 | |
| 169 -q, --quote *yes | no* | |
| 170 Put quote around column values in output CSV/TSV text file(s). | |
| 171 Possible values: *yes or no*. Default value: *yes*. | |
| 172 | |
| 173 --record *recnum | recnums | startrecnum,endrecnum* | |
| 174 Record number, record numbers or range of records to extract during | |
| 175 *recordnum*, *recordnums* and *recordrange* mode. Input value format | |
| 176 is: <num>, <num1,num2,...> and <startnum, endnum> for *recordnum*, | |
| 177 *recordnums* and *recordrange* modes recpectively. Default value: | |
| 178 none. | |
| 179 | |
| 180 --RegexIgnoreCase *yes or no* | |
| 181 Specify whether to ingnore case during *datafieldsbyregex* value of | |
| 182 -m, --mode option. Possible values: *yes or no*. Default value: | |
| 183 *yes*. | |
| 184 | |
| 185 -r, --root *rootname* | |
| 186 New file name is generated using the root: <Root>.<Ext>. Default for | |
| 187 new file names: <SDFileName><mode>.<Ext>. The file type determines | |
| 188 <Ext> value. The sdf, csv, and tsv <Ext> values are used for SD, | |
| 189 comma/semicolon, and tab delimited text files respectively.This | |
| 190 option is ignored for multiple input files. | |
| 191 | |
| 192 -s, --seed *number* | |
| 193 Random number seed used for *randomcmpds* mode. Default:123456789. | |
| 194 | |
| 195 --StrDataString *yes | no* | |
| 196 Specify whether to write out structure data string to CSV/TSV text | |
| 197 file(s). Possible values: *yes or no*. Default value: *no*. | |
| 198 | |
| 199 The value of StrDataStringDelimiter option is used as a delimiter to | |
| 200 join structure data lines into a structure data string. | |
| 201 | |
| 202 This option is ignored during generation of SD file(s). | |
| 203 | |
| 204 --StrDataStringDelimiter *text* | |
| 205 Delimiter for joining multiple stucture data lines into a string | |
| 206 before writing to CSV/TSV text file(s). Possible values: *any | |
| 207 alphanumeric text*. Default value: *|*. | |
| 208 | |
| 209 This option is ignored during generation of SD file(s). | |
| 210 | |
| 211 --StrDataStringMode *StrOnly | StrAndDataFields* | |
| 212 Specify whether to include SD data fields and values along with the | |
| 213 structure data into structure data string before writing it out to | |
| 214 CSV/TSV text file(s). Possible values: *StrOnly or | |
| 215 StrAndDataFields*. Default value: *StrOnly*. | |
| 216 | |
| 217 The value of StrDataStringDelimiter option is used as a delimiter to | |
| 218 join structure data lines into a structure data string. | |
| 219 | |
| 220 This option is ignored during generation of SD file(s). | |
| 221 | |
| 222 --ValueComparisonMode *Numeric | Alphanumeric* | |
| 223 Specify how to compare data field values during *datafieldsbyvalue* | |
| 224 mode: Compare values using either numeric or string ((eq, le, ge) | |
| 225 comparison operators. Possible values: *Numeric or Alphanumeric*. | |
| 226 Defaule value: *Numeric*. | |
| 227 | |
| 228 -v, --violations *number* | |
| 229 Number of criterion violations allowed for values specified during | |
| 230 *datafieldsbyvalue* and *datafieldsbyregex* mode. Default value: | |
| 231 *0*. | |
| 232 | |
| 233 -w, --workingdir *dirname* | |
| 234 Location of working directory. Default: current directory. | |
| 235 | |
| 236 EXAMPLES | |
| 237 To retrieve all data fields from SD files and generate CSV text files, | |
| 238 type: | |
| 239 | |
| 240 % ExtractFromSDFiles.pl -o Sample.sdf | |
| 241 % ExtractFromSDFiles.pl -o *.sdf | |
| 242 | |
| 243 To retrieve all data fields from SD file and generate CSV text files | |
| 244 containing a column with structure data as a string with | as line | |
| 245 delimiter, type: | |
| 246 | |
| 247 % ExtractFromSDFiles.pl --StrDataString Yes -o Sample.sdf | |
| 248 | |
| 249 To retrieve MOL_ID data fileld from SD file and generate CSV text files | |
| 250 containing a column with structure data along with all data fields as a | |
| 251 string with | as line delimiter, type: | |
| 252 | |
| 253 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID" --StrDataString Yes | |
| 254 --StrDataStringMode StrAndDataFields --StrDataStringDelimiter "|" | |
| 255 --output text -o Sample.sdf | |
| 256 | |
| 257 To retrieve common data fields which exists for all the compounds in a | |
| 258 SD file and generate a TSV text file NewSample.tsv, type: | |
| 259 | |
| 260 % ExtractFromSDFiles.pl -m commondatafields --outdelim tab -r NewSample | |
| 261 --output Text -o Sample.sdf | |
| 262 | |
| 263 To retrieve MolId, ExtReg, and CompoundName data field from a SD file | |
| 264 and generate a CSV text file NewSample.csv, type: | |
| 265 | |
| 266 % ExtractFromSDFiles.pl -m datafields -d "Mol_ID,MolWeight, | |
| 267 CompoundName" -r NewSample --output Text -o Sample.sdf | |
| 268 | |
| 269 To retrieve compounds from a SD which meet a specific set of criteria - | |
| 270 MolWt <= 450, LogP <= 5 and SumNO < 10 - from a SD file and generate a | |
| 271 new SD file NewSample.sdf, type: | |
| 272 | |
| 273 % ExtractFromSDFiles.pl -m datafieldsbyvalue -d "MolWt,450,le,LogP | |
| 274 ,5,le,SumNO,10" -r NewSample -o Sample.sdf | |
| 275 | |
| 276 To retrive compounds from a SD file with a specific set of values for | |
| 277 MolID and generate a new SD file NewSample.sdf, type: | |
| 278 | |
| 279 % ExtractFromSDFiles.pl -m datafieldbylist -d "Mol_ID,159,4509,4619" | |
| 280 -r NewSample -o Sample.sdf | |
| 281 | |
| 282 To retrive compounds from a SD file with values for MolID not on a list | |
| 283 of specified values and generate a new SD file NewSample.sdf, type: | |
| 284 | |
| 285 % ExtractFromSDFiles.pl -m datafieldnotbylist -d "Mol_ID,159,4509,4619" | |
| 286 -r NewSample -o Sample.sdf | |
| 287 | |
| 288 To retrive 10 random compounds from a SD file and generate a new SD file | |
| 289 RandomSample.sdf, type: | |
| 290 | |
| 291 % ExtractFromSDFiles.pl -m randomcmpds -n 10 -r RandomSample | |
| 292 -o Sample.sdf | |
| 293 | |
| 294 To retrive compound record number 10 from a SD file and generate a new | |
| 295 SD file NewSample.sdf, type: | |
| 296 | |
| 297 % ExtractFromSDFiles.pl -m recordnum --record 10 -r NewSample | |
| 298 -o Sample.sdf | |
| 299 | |
| 300 To retrive compound record numbers 10, 20 and 30 from a SD file and | |
| 301 generate a new SD file NewSample.sdf, type: | |
| 302 | |
| 303 % ExtractFromSDFiles.pl -m recordnums --record 10,20,30 -r NewSample | |
| 304 -o Sample.sdf | |
| 305 | |
| 306 To retrive compound records between 10 to 20 from SD file and generate a | |
| 307 new SD file NewSample.sdf, type: | |
| 308 | |
| 309 % ExtractFromSDFiles.pl -m recordrange --record 10,20 -r NewSample | |
| 310 -o Sample.sdf | |
| 311 | |
| 312 AUTHOR | |
| 313 Manish Sud <msud@san.rr.com> | |
| 314 | |
| 315 SEE ALSO | |
| 316 FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl, | |
| 317 MergeTextFilesWithSD.pl | |
| 318 | |
| 319 COPYRIGHT | |
| 320 Copyright (C) 2015 Manish Sud. All rights reserved. | |
| 321 | |
| 322 This file is part of MayaChemTools. | |
| 323 | |
| 324 MayaChemTools is free software; you can redistribute it and/or modify it | |
| 325 under the terms of the GNU Lesser General Public License as published by | |
| 326 the Free Software Foundation; either version 3 of the License, or (at | |
| 327 your option) any later version. | |
| 328 |
