Mercurial > repos > deepakjadmin > mayatool3_test2
diff docs/scripts/txt/ExtractFromSDFiles.txt @ 0:4816e4a8ae95 draft default tip
Uploaded
author | deepakjadmin |
---|---|
date | Wed, 20 Jan 2016 09:23:18 -0500 |
parents | |
children |
line wrap: on
line diff
--- /dev/null Thu Jan 01 00:00:00 1970 +0000 +++ b/docs/scripts/txt/ExtractFromSDFiles.txt Wed Jan 20 09:23:18 2016 -0500 @@ -0,0 +1,328 @@ +NAME + ExtractFromSDFiles.pl - Extract specific data from SDFile(s) + +SYNOPSIS + ExtractFromSDFiles.pl SDFile(s)... + + ExtractFromSDFiles.pl [-h, --help] [-d, --datafields "fieldlabel,..." | + "fieldlabel,value,criteria..." | "fieldlabel,value,value..."] + [--datafieldsfile filename] [--indelim comma | tab | semicolon] [-m, + --mode alldatafields | commondatafields | | datafieldnotbylist | + datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist | + datafielduniquebylist | molnames | randomcmpds | recordnum | recordnums + | recordrange | 2dcmpdrecords | 3dcmpdrecords ] [-n, --numofcmpds + number] [--outdelim comma | tab | semicolon] [--output SD | text | both] + [-o, --overwrite] [-q, --quote yes | no] [--record recnum | + startrecnum,endrecnum] --RegexIgnoreCase *yes or no* [-r, --root + rootname] [-s, --seed number] [--StrDataString yes | no] + [--StrDataStringDelimiter text] [--StrDataStringMode StrOnly | + StrAndDataFields] [--ValueComparisonMode *Numeric | Alphanumeric*] [-v, + --violations- number] [-w, --workingdir dirname] SDFile(s)... + +DESCRIPTION + Extract specific data from *SDFile(s)* and generate appropriate SD or + CSV/TSV text file(s). The structure data from SDFile(s) is not + transferred to CSV/TSV text file(s). Multiple SDFile names are separated + by spaces. The valid file extensions are *.sdf* and *.sd*. All other + file names are ignored. All the SD files in a current directory can be + specified either by **.sdf* or the current directory name. + +OPTIONS + -h, --help + Print this help message. + + -d, --datafields *"fieldlabel,..." | "fieldlabel,value,criteria..." | + "fieldlabel,value,value,..."* + This value is mode specific. In general, it's a list of comma + separated data field labels and associated mode specific values. + + For *datafields* mode, input value format is: *fieldlabel,...*. + Examples: + + Extreg + Extreg,CompoundName,ID + + For *datafieldsbyvalue* mode, input value format contains these + triplets: *fieldlabel,value, criteria...*. Possible values for + criteria: *le, ge or eq*. The values of --ValueComparisonMode + indicates whether values are compared numerical or string comarison + operators. Default is to consider data field values as numerical + values and use numerical comparison operators. Examples: + + MolWt,450,le + MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le + + For *datafieldsbyregex* mode, input value format contains these + triplets: *fieldlabel,regex, criteria...*. *regex* corresponds to + any valid regular expression and is used to match the values for + specified *fieldlabel*. Possible values for criteria: *eq or ne*. + During *eq* and *ne* values, data field label value is matched with + regular expression using =~ and !~ respectively. --RegexIgnoreCase + option value is used to determine whether to ignore letter + upper/lower case during regular expression match. Examples: + + Name,ol,eq + Name,'^pat',ne + + For *datafieldbylist* and *datafielduniquebylist* mode, input value + format is: *fieldlabel,value1,value2...*. This is equivalent to + *datafieldsbyvalue* mode with this input value + format:*fieldlabel,value1,eq,fieldlabel,value2,eq,...*. For + *datafielduniquebylist* mode, only unique compounds identified by + first occurrence of *value* associated with *fieldlabel* in + *SDFile(s)* are kept; any subsequent compounds are simply ignored. + + For *datafieldnotbylist* mode, input value format is: + *fieldlabel,value1,value2...*. In this mode, the script behaves + exactly opposite of *datafieldbylist* mode, and only those compounds + are extracted whose data field values don't match any specified data + field value. + + --datafieldsfile *filename* + Filename which contains various mode specific values. This option + provides a way to specify mode specific values in a file instead of + entering them on the command line using -d --datafields. + + For *datafields* mode, input file lines contain comma delimited + field labels: *fieldlabel,...*. Example: + + Line 1:MolId + Line 2:"Extreg",CompoundName,ID + + For *datafieldsbyvalue* mode, input file lines contains these comma + separated triplets: *fieldlabel,value, criteria*. Possible values + for criteria: *le, ge or eq*. Examples: + + Line 1:MolWt,450,le + + Line 1:"MolWt",450,le,"LogP",5,le,"SumNumNO",10,le,"SumNHOH",5,le + + Line 1:MolWt,450,le + Line 2:"LogP",5,le + Line 3:"SumNumNO",10,le + Line 4: SumNHOH,5,le + + For *datafieldbylist* and *datafielduniquebylist* mode, input file + line format is: + + Line 1:fieldlabel; + Subsequent lines:value1,value2... + + For *datafieldbylist*, *datafielduniquebylist*, and + *datafieldnotbylist* mode, input file line format is: + + Line 1:fieldlabel; + Subsequent lines:value1,value2... + + For *datafielduniquebylist* mode, only unique compounds identified + by first occurrence of *value* associated with *fieldlabel* in + *SDFile(s)* are kept; any subsequent compounds are simply ignored. + Example: + + Line 1: MolID + Subsequent Lines: + 907508 + 832291,4642 + "1254","907303" + + --indelim *comma | tab | semicolon* + Delimiter used to specify text values for -d --datafields and + --datafieldsfile options. Possible values: *comma, tab, or + semicolon*. Default value: *comma*. + + -m, --mode *alldatafields | commondatafields | datafields | + datafieldsbyvalue | datafieldsbyregex | datafieldbylist | + datafielduniquebylist | datafieldnotbylist | molnames | randomcmpds | + recordnum | recordnums | recordrange | 2dcmpdrecords | 3dcmpdrecords* + Specify what to extract from *SDFile(s)*. Possible values: + *alldatafields, commondatafields, datafields, datafieldsbyvalue, + datafieldsbyregex, datafieldbylist, datafielduniquebylist, + datafieldnotbylist, molnames, randomcmpds, recordnum, recordnums, + recordrange, 2dcmpdrecords, 3dcmpdrecords*. Default value: + *alldatafields*. + + For *alldatafields* and *molnames* mode, only a CSV/TSV text file is + generated; for all other modes, however, a SD file is generated by + default - you can change the behavior to genereate text file using + *--output* option. + + For *3DCmpdRecords* mode, only those compounds with at least one + non-zero value for Z atomic coordinates are retrieved; however, + during retrieval of compounds in *2DCmpdRecords* mode, all Z atomic + coordinates must be zero. + + -n, --numofcmpds *number* + Number of compouds to extract during *randomcmpds* mode. + + --outdelim *comma | tab | semicolon* + Delimiter for output CSV/TSV text file(s). Possible values: *comma, + tab, or semicolon* Default value: *comma* + + --output *SD | text | both* + Type of output files to generate. Possible values: *SD, text, or + both*. Default value: *SD*. For *alldatafields* and *molnames* mode, + this option is ingored and only a CSV/TSV text file is generated. + + -o, --overwrite + Overwrite existing files. + + -q, --quote *yes | no* + Put quote around column values in output CSV/TSV text file(s). + Possible values: *yes or no*. Default value: *yes*. + + --record *recnum | recnums | startrecnum,endrecnum* + Record number, record numbers or range of records to extract during + *recordnum*, *recordnums* and *recordrange* mode. Input value format + is: <num>, <num1,num2,...> and <startnum, endnum> for *recordnum*, + *recordnums* and *recordrange* modes recpectively. Default value: + none. + + --RegexIgnoreCase *yes or no* + Specify whether to ingnore case during *datafieldsbyregex* value of + -m, --mode option. Possible values: *yes or no*. Default value: + *yes*. + + -r, --root *rootname* + New file name is generated using the root: <Root>.<Ext>. Default for + new file names: <SDFileName><mode>.<Ext>. The file type determines + <Ext> value. The sdf, csv, and tsv <Ext> values are used for SD, + comma/semicolon, and tab delimited text files respectively.This + option is ignored for multiple input files. + + -s, --seed *number* + Random number seed used for *randomcmpds* mode. Default:123456789. + + --StrDataString *yes | no* + Specify whether to write out structure data string to CSV/TSV text + file(s). Possible values: *yes or no*. Default value: *no*. + + The value of StrDataStringDelimiter option is used as a delimiter to + join structure data lines into a structure data string. + + This option is ignored during generation of SD file(s). + + --StrDataStringDelimiter *text* + Delimiter for joining multiple stucture data lines into a string + before writing to CSV/TSV text file(s). Possible values: *any + alphanumeric text*. Default value: *|*. + + This option is ignored during generation of SD file(s). + + --StrDataStringMode *StrOnly | StrAndDataFields* + Specify whether to include SD data fields and values along with the + structure data into structure data string before writing it out to + CSV/TSV text file(s). Possible values: *StrOnly or + StrAndDataFields*. Default value: *StrOnly*. + + The value of StrDataStringDelimiter option is used as a delimiter to + join structure data lines into a structure data string. + + This option is ignored during generation of SD file(s). + + --ValueComparisonMode *Numeric | Alphanumeric* + Specify how to compare data field values during *datafieldsbyvalue* + mode: Compare values using either numeric or string ((eq, le, ge) + comparison operators. Possible values: *Numeric or Alphanumeric*. + Defaule value: *Numeric*. + + -v, --violations *number* + Number of criterion violations allowed for values specified during + *datafieldsbyvalue* and *datafieldsbyregex* mode. Default value: + *0*. + + -w, --workingdir *dirname* + Location of working directory. Default: current directory. + +EXAMPLES + To retrieve all data fields from SD files and generate CSV text files, + type: + + % ExtractFromSDFiles.pl -o Sample.sdf + % ExtractFromSDFiles.pl -o *.sdf + + To retrieve all data fields from SD file and generate CSV text files + containing a column with structure data as a string with | as line + delimiter, type: + + % ExtractFromSDFiles.pl --StrDataString Yes -o Sample.sdf + + To retrieve MOL_ID data fileld from SD file and generate CSV text files + containing a column with structure data along with all data fields as a + string with | as line delimiter, type: + + % ExtractFromSDFiles.pl -m datafields -d "Mol_ID" --StrDataString Yes + --StrDataStringMode StrAndDataFields --StrDataStringDelimiter "|" + --output text -o Sample.sdf + + To retrieve common data fields which exists for all the compounds in a + SD file and generate a TSV text file NewSample.tsv, type: + + % ExtractFromSDFiles.pl -m commondatafields --outdelim tab -r NewSample + --output Text -o Sample.sdf + + To retrieve MolId, ExtReg, and CompoundName data field from a SD file + and generate a CSV text file NewSample.csv, type: + + % ExtractFromSDFiles.pl -m datafields -d "Mol_ID,MolWeight, + CompoundName" -r NewSample --output Text -o Sample.sdf + + To retrieve compounds from a SD which meet a specific set of criteria - + MolWt <= 450, LogP <= 5 and SumNO < 10 - from a SD file and generate a + new SD file NewSample.sdf, type: + + % ExtractFromSDFiles.pl -m datafieldsbyvalue -d "MolWt,450,le,LogP + ,5,le,SumNO,10" -r NewSample -o Sample.sdf + + To retrive compounds from a SD file with a specific set of values for + MolID and generate a new SD file NewSample.sdf, type: + + % ExtractFromSDFiles.pl -m datafieldbylist -d "Mol_ID,159,4509,4619" + -r NewSample -o Sample.sdf + + To retrive compounds from a SD file with values for MolID not on a list + of specified values and generate a new SD file NewSample.sdf, type: + + % ExtractFromSDFiles.pl -m datafieldnotbylist -d "Mol_ID,159,4509,4619" + -r NewSample -o Sample.sdf + + To retrive 10 random compounds from a SD file and generate a new SD file + RandomSample.sdf, type: + + % ExtractFromSDFiles.pl -m randomcmpds -n 10 -r RandomSample + -o Sample.sdf + + To retrive compound record number 10 from a SD file and generate a new + SD file NewSample.sdf, type: + + % ExtractFromSDFiles.pl -m recordnum --record 10 -r NewSample + -o Sample.sdf + + To retrive compound record numbers 10, 20 and 30 from a SD file and + generate a new SD file NewSample.sdf, type: + + % ExtractFromSDFiles.pl -m recordnums --record 10,20,30 -r NewSample + -o Sample.sdf + + To retrive compound records between 10 to 20 from SD file and generate a + new SD file NewSample.sdf, type: + + % ExtractFromSDFiles.pl -m recordrange --record 10,20 -r NewSample + -o Sample.sdf + +AUTHOR + Manish Sud <msud@san.rr.com> + +SEE ALSO + FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl, + MergeTextFilesWithSD.pl + +COPYRIGHT + Copyright (C) 2015 Manish Sud. All rights reserved. + + This file is part of MayaChemTools. + + MayaChemTools is free software; you can redistribute it and/or modify it + under the terms of the GNU Lesser General Public License as published by + the Free Software Foundation; either version 3 of the License, or (at + your option) any later version. +