Mercurial > repos > deepakjadmin > mayatool3_test2
view docs/scripts/txt/ExtractFromSDFiles.txt @ 0:4816e4a8ae95 draft default tip
Uploaded
author | deepakjadmin |
---|---|
date | Wed, 20 Jan 2016 09:23:18 -0500 |
parents | |
children |
line wrap: on
line source
NAME ExtractFromSDFiles.pl - Extract specific data from SDFile(s) SYNOPSIS ExtractFromSDFiles.pl SDFile(s)... ExtractFromSDFiles.pl [-h, --help] [-d, --datafields "fieldlabel,..." | "fieldlabel,value,criteria..." | "fieldlabel,value,value..."] [--datafieldsfile filename] [--indelim comma | tab | semicolon] [-m, --mode alldatafields | commondatafields | | datafieldnotbylist | datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist | datafielduniquebylist | molnames | randomcmpds | recordnum | recordnums | recordrange | 2dcmpdrecords | 3dcmpdrecords ] [-n, --numofcmpds number] [--outdelim comma | tab | semicolon] [--output SD | text | both] [-o, --overwrite] [-q, --quote yes | no] [--record recnum | startrecnum,endrecnum] --RegexIgnoreCase *yes or no* [-r, --root rootname] [-s, --seed number] [--StrDataString yes | no] [--StrDataStringDelimiter text] [--StrDataStringMode StrOnly | StrAndDataFields] [--ValueComparisonMode *Numeric | Alphanumeric*] [-v, --violations- number] [-w, --workingdir dirname] SDFile(s)... DESCRIPTION Extract specific data from *SDFile(s)* and generate appropriate SD or CSV/TSV text file(s). The structure data from SDFile(s) is not transferred to CSV/TSV text file(s). Multiple SDFile names are separated by spaces. The valid file extensions are *.sdf* and *.sd*. All other file names are ignored. All the SD files in a current directory can be specified either by **.sdf* or the current directory name. OPTIONS -h, --help Print this help message. -d, --datafields *"fieldlabel,..." | "fieldlabel,value,criteria..." | "fieldlabel,value,value,..."* This value is mode specific. In general, it's a list of comma separated data field labels and associated mode specific values. For *datafields* mode, input value format is: *fieldlabel,...*. Examples: Extreg Extreg,CompoundName,ID For *datafieldsbyvalue* mode, input value format contains these triplets: *fieldlabel,value, criteria...*. Possible values for criteria: *le, ge or eq*. The values of --ValueComparisonMode indicates whether values are compared numerical or string comarison operators. Default is to consider data field values as numerical values and use numerical comparison operators. Examples: MolWt,450,le MolWt,450,le,LogP,5,le,SumNumNO,10,le,SumNHOH,5,le For *datafieldsbyregex* mode, input value format contains these triplets: *fieldlabel,regex, criteria...*. *regex* corresponds to any valid regular expression and is used to match the values for specified *fieldlabel*. Possible values for criteria: *eq or ne*. During *eq* and *ne* values, data field label value is matched with regular expression using =~ and !~ respectively. --RegexIgnoreCase option value is used to determine whether to ignore letter upper/lower case during regular expression match. Examples: Name,ol,eq Name,'^pat',ne For *datafieldbylist* and *datafielduniquebylist* mode, input value format is: *fieldlabel,value1,value2...*. This is equivalent to *datafieldsbyvalue* mode with this input value format:*fieldlabel,value1,eq,fieldlabel,value2,eq,...*. For *datafielduniquebylist* mode, only unique compounds identified by first occurrence of *value* associated with *fieldlabel* in *SDFile(s)* are kept; any subsequent compounds are simply ignored. For *datafieldnotbylist* mode, input value format is: *fieldlabel,value1,value2...*. In this mode, the script behaves exactly opposite of *datafieldbylist* mode, and only those compounds are extracted whose data field values don't match any specified data field value. --datafieldsfile *filename* Filename which contains various mode specific values. This option provides a way to specify mode specific values in a file instead of entering them on the command line using -d --datafields. For *datafields* mode, input file lines contain comma delimited field labels: *fieldlabel,...*. Example: Line 1:MolId Line 2:"Extreg",CompoundName,ID For *datafieldsbyvalue* mode, input file lines contains these comma separated triplets: *fieldlabel,value, criteria*. Possible values for criteria: *le, ge or eq*. Examples: Line 1:MolWt,450,le Line 1:"MolWt",450,le,"LogP",5,le,"SumNumNO",10,le,"SumNHOH",5,le Line 1:MolWt,450,le Line 2:"LogP",5,le Line 3:"SumNumNO",10,le Line 4: SumNHOH,5,le For *datafieldbylist* and *datafielduniquebylist* mode, input file line format is: Line 1:fieldlabel; Subsequent lines:value1,value2... For *datafieldbylist*, *datafielduniquebylist*, and *datafieldnotbylist* mode, input file line format is: Line 1:fieldlabel; Subsequent lines:value1,value2... For *datafielduniquebylist* mode, only unique compounds identified by first occurrence of *value* associated with *fieldlabel* in *SDFile(s)* are kept; any subsequent compounds are simply ignored. Example: Line 1: MolID Subsequent Lines: 907508 832291,4642 "1254","907303" --indelim *comma | tab | semicolon* Delimiter used to specify text values for -d --datafields and --datafieldsfile options. Possible values: *comma, tab, or semicolon*. Default value: *comma*. -m, --mode *alldatafields | commondatafields | datafields | datafieldsbyvalue | datafieldsbyregex | datafieldbylist | datafielduniquebylist | datafieldnotbylist | molnames | randomcmpds | recordnum | recordnums | recordrange | 2dcmpdrecords | 3dcmpdrecords* Specify what to extract from *SDFile(s)*. Possible values: *alldatafields, commondatafields, datafields, datafieldsbyvalue, datafieldsbyregex, datafieldbylist, datafielduniquebylist, datafieldnotbylist, molnames, randomcmpds, recordnum, recordnums, recordrange, 2dcmpdrecords, 3dcmpdrecords*. Default value: *alldatafields*. For *alldatafields* and *molnames* mode, only a CSV/TSV text file is generated; for all other modes, however, a SD file is generated by default - you can change the behavior to genereate text file using *--output* option. For *3DCmpdRecords* mode, only those compounds with at least one non-zero value for Z atomic coordinates are retrieved; however, during retrieval of compounds in *2DCmpdRecords* mode, all Z atomic coordinates must be zero. -n, --numofcmpds *number* Number of compouds to extract during *randomcmpds* mode. --outdelim *comma | tab | semicolon* Delimiter for output CSV/TSV text file(s). Possible values: *comma, tab, or semicolon* Default value: *comma* --output *SD | text | both* Type of output files to generate. Possible values: *SD, text, or both*. Default value: *SD*. For *alldatafields* and *molnames* mode, this option is ingored and only a CSV/TSV text file is generated. -o, --overwrite Overwrite existing files. -q, --quote *yes | no* Put quote around column values in output CSV/TSV text file(s). Possible values: *yes or no*. Default value: *yes*. --record *recnum | recnums | startrecnum,endrecnum* Record number, record numbers or range of records to extract during *recordnum*, *recordnums* and *recordrange* mode. Input value format is: <num>, <num1,num2,...> and <startnum, endnum> for *recordnum*, *recordnums* and *recordrange* modes recpectively. Default value: none. --RegexIgnoreCase *yes or no* Specify whether to ingnore case during *datafieldsbyregex* value of -m, --mode option. Possible values: *yes or no*. Default value: *yes*. -r, --root *rootname* New file name is generated using the root: <Root>.<Ext>. Default for new file names: <SDFileName><mode>.<Ext>. The file type determines <Ext> value. The sdf, csv, and tsv <Ext> values are used for SD, comma/semicolon, and tab delimited text files respectively.This option is ignored for multiple input files. -s, --seed *number* Random number seed used for *randomcmpds* mode. Default:123456789. --StrDataString *yes | no* Specify whether to write out structure data string to CSV/TSV text file(s). Possible values: *yes or no*. Default value: *no*. The value of StrDataStringDelimiter option is used as a delimiter to join structure data lines into a structure data string. This option is ignored during generation of SD file(s). --StrDataStringDelimiter *text* Delimiter for joining multiple stucture data lines into a string before writing to CSV/TSV text file(s). Possible values: *any alphanumeric text*. Default value: *|*. This option is ignored during generation of SD file(s). --StrDataStringMode *StrOnly | StrAndDataFields* Specify whether to include SD data fields and values along with the structure data into structure data string before writing it out to CSV/TSV text file(s). Possible values: *StrOnly or StrAndDataFields*. Default value: *StrOnly*. The value of StrDataStringDelimiter option is used as a delimiter to join structure data lines into a structure data string. This option is ignored during generation of SD file(s). --ValueComparisonMode *Numeric | Alphanumeric* Specify how to compare data field values during *datafieldsbyvalue* mode: Compare values using either numeric or string ((eq, le, ge) comparison operators. Possible values: *Numeric or Alphanumeric*. Defaule value: *Numeric*. -v, --violations *number* Number of criterion violations allowed for values specified during *datafieldsbyvalue* and *datafieldsbyregex* mode. Default value: *0*. -w, --workingdir *dirname* Location of working directory. Default: current directory. EXAMPLES To retrieve all data fields from SD files and generate CSV text files, type: % ExtractFromSDFiles.pl -o Sample.sdf % ExtractFromSDFiles.pl -o *.sdf To retrieve all data fields from SD file and generate CSV text files containing a column with structure data as a string with | as line delimiter, type: % ExtractFromSDFiles.pl --StrDataString Yes -o Sample.sdf To retrieve MOL_ID data fileld from SD file and generate CSV text files containing a column with structure data along with all data fields as a string with | as line delimiter, type: % ExtractFromSDFiles.pl -m datafields -d "Mol_ID" --StrDataString Yes --StrDataStringMode StrAndDataFields --StrDataStringDelimiter "|" --output text -o Sample.sdf To retrieve common data fields which exists for all the compounds in a SD file and generate a TSV text file NewSample.tsv, type: % ExtractFromSDFiles.pl -m commondatafields --outdelim tab -r NewSample --output Text -o Sample.sdf To retrieve MolId, ExtReg, and CompoundName data field from a SD file and generate a CSV text file NewSample.csv, type: % ExtractFromSDFiles.pl -m datafields -d "Mol_ID,MolWeight, CompoundName" -r NewSample --output Text -o Sample.sdf To retrieve compounds from a SD which meet a specific set of criteria - MolWt <= 450, LogP <= 5 and SumNO < 10 - from a SD file and generate a new SD file NewSample.sdf, type: % ExtractFromSDFiles.pl -m datafieldsbyvalue -d "MolWt,450,le,LogP ,5,le,SumNO,10" -r NewSample -o Sample.sdf To retrive compounds from a SD file with a specific set of values for MolID and generate a new SD file NewSample.sdf, type: % ExtractFromSDFiles.pl -m datafieldbylist -d "Mol_ID,159,4509,4619" -r NewSample -o Sample.sdf To retrive compounds from a SD file with values for MolID not on a list of specified values and generate a new SD file NewSample.sdf, type: % ExtractFromSDFiles.pl -m datafieldnotbylist -d "Mol_ID,159,4509,4619" -r NewSample -o Sample.sdf To retrive 10 random compounds from a SD file and generate a new SD file RandomSample.sdf, type: % ExtractFromSDFiles.pl -m randomcmpds -n 10 -r RandomSample -o Sample.sdf To retrive compound record number 10 from a SD file and generate a new SD file NewSample.sdf, type: % ExtractFromSDFiles.pl -m recordnum --record 10 -r NewSample -o Sample.sdf To retrive compound record numbers 10, 20 and 30 from a SD file and generate a new SD file NewSample.sdf, type: % ExtractFromSDFiles.pl -m recordnums --record 10,20,30 -r NewSample -o Sample.sdf To retrive compound records between 10 to 20 from SD file and generate a new SD file NewSample.sdf, type: % ExtractFromSDFiles.pl -m recordrange --record 10,20 -r NewSample -o Sample.sdf AUTHOR Manish Sud <msud@san.rr.com> SEE ALSO FilterSDFiles.pl, InfoSDFiles.pl, SplitSDFiles.pl, MergeTextFilesWithSD.pl COPYRIGHT Copyright (C) 2015 Manish Sud. All rights reserved. This file is part of MayaChemTools. MayaChemTools is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 3 of the License, or (at your option) any later version.