comparison docs/scripts/txt/ExtendedConnectivityFingerprints.txt @ 0:4816e4a8ae95 draft default tip

Uploaded
author deepakjadmin
date Wed, 20 Jan 2016 09:23:18 -0500
parents
children
comparison
equal deleted inserted replaced
-1:000000000000 0:4816e4a8ae95
1 NAME
2 ExtendedConnectivityFingerprints.pl - Generate extended connectivity
3 fingerprints for SD files
4
5 SYNOPSIS
6 ExtendedConnectivityFingerprints.pl SDFile(s)...
7
8 ExtendedConnectivityFingerprints.pl [--AromaticityModel
9 *AromaticityModelType*] [-a, --AtomIdentifierType
10 *AtomicInvariantsAtomTypes*] [--AtomicInvariantsToUse
11 *"AtomicInvariant,AtomicInvariant..."*] [--FunctionalClassesToUse
12 *"FunctionalClass1,FunctionalClass2..."*] [--BitsOrder *Ascending |
13 Descending*] [-b, --BitStringFormat *BinaryString | HexadecimalString*]
14 [--CompoundID *DataFieldName or LabelPrefixString*] [--CompoundIDLabel
15 *text*] [--CompoundIDMode] [--DataFields
16 *"FieldLabel1,FieldLabel2,..."*] [-d, --DataFieldsMode *All | Common |
17 Specify | CompoundID*] [-f, --Filter *Yes | No*] [--FingerprintsLabel
18 *text*] [-h, --help] [-k, --KeepLargestComponent *Yes | No*] [-m, --mode
19 *ExtendedConnectivity | ExtendedConnecticityCount |
20 ExtendedConnecticityBits*] [-n, --NeighborhoodRadius *number*]
21 [--OutDelim *comma | tab | semicolon*] [--output *SD | FP | text | all*]
22 [-o, --overwrite] [-q, --quote *Yes | No*] [-r, --root *RootName*] [-s,
23 --size *number*] [--UsePerlCoreRandom *Yes | No*] [-v,
24 --VectorStringFormat *IDsAndValuesString | IDsAndValuesPairsString |
25 ValuesAndIDsString | ValuesAndIDsPairsString*] [-w, --WorkingDir
26 dirname] SDFile(s)...
27
28 DESCRIPTION
29 Generate extended connectivity fingerprints [ Ref 48, Ref 52 ] for
30 *SDFile(s)* and create appropriate SD, FP or CSV/TSV text file(s)
31 containing fingerprints vector strings corresponding to molecular
32 fingerprints.
33
34 Multiple SDFile names are separated by spaces. The valid file extensions
35 are *.sdf* and *.sd*. All other file names are ignored. All the SD files
36 in a current directory can be specified either by **.sdf* or the current
37 directory name.
38
39 The current release of MayaChemTools supports generation of extended
40 connectivity fingerprints corresponding to following -a,
41 --AtomIdentifierTypes:
42
43 AtomicInvariantsAtomTypes, DREIDINGAtomTypes, EStateAtomTypes,
44 FunctionalClassAtomTypes, MMFF94AtomTypes, SLogPAtomTypes,
45 SYBYLAtomTypes, TPSAAtomTypes, UFFAtomTypes
46
47 Based on values specified for -a, --AtomIdentifierType,
48 --AtomicInvariantsToUse and --FunctionalClassesToUse, initial atom types
49 are assigned to all non-hydrogen atoms in a molecule and these atom
50 types strings are converted into initial atom identifier integers using
51 TextUtil::HashCode function. The duplicate atom identifiers are removed.
52
53 For -n, --NeighborhoodRadius value of *0*, the initial set of unique
54 atom identifiers comprises the molecule fingerprints. Otherwise, atom
55 neighborhoods are generated for each non-hydrogen atom up to specified
56 -n, --NeighborhoodRadius value. For each non-hydrogen central atom at a
57 specific radius, its neighbors at next radius level along with their
58 bond orders and previously calculated atom identifiers are collected
59 which in turn are used to generate a new integer atom identifier; the
60 bond orders and atom identifier pairs list is first sorted by bond order
61 followed by atom identifiers to make these values graph invariant.
62
63 After integer atom identifiers have been generated for all non-hydrogen
64 atoms at all specified neighborhood radii, the duplicate integer atom
65 identifiers corresponding to same hash code value generated using
66 TextUtil::HashCode are tracked by keeping the atom identifiers at lower
67 radius. Additionally, all structurally duplicate integer atom
68 identifiers at each specified radius are also tracked by identifying
69 equivalent atoms and bonds corresponding to substructures used for
70 generating atom identifier and keeping integer atom identifier with
71 lowest value.
72
73 For *ExtendedConnnectivity* value of fingerprints -m, --mode, the
74 duplicate identifiers are removed from the list and the unique atom
75 identifiers constitute the extended connectivity fingerprints of a
76 molecule.
77
78 For *ExtendedConnnectivityCount* value of fingerprints -m, --mode, the
79 occurrence of each unique atom identifiers appears is counted and the
80 unique atom identifiers along with their count constitute the extended
81 connectivity fingerprints of a molecule.
82
83 For *ExtendedConnectivityBits* value of fingerprints -m, --mode, the
84 unique atom identifiers are used as a random number seed to generate a
85 random integer value between 0 and --Size which in turn is used to set
86 corresponding bits in the fingerprint bit-vector string.
87
88 Example of *SD* file containing extended connectivity fingerprints
89 string data:
90
91 ... ...
92 ... ...
93 $$$$
94 ... ...
95 ... ...
96 ... ...
97 41 44 0 0 0 0 0 0 0 0999 V2000
98 -3.3652 1.4499 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
99 ... ...
100 2 3 1 0 0 0 0
101 ... ...
102 M END
103 > <CmpdID>
104 Cmpd1
105
106 > <ExtendedConnectivityFingerprints>
107 FingerprintsVector;ExtendedConnectivity:AtomicInvariantsAtomTypes:Radiu
108 s2;60;AlphaNumericalValues;ValuesString;73555770 333564680 352413391 66
109 6191900 1001270906 1371674323 1481469939 1977749791 2006158649 21414087
110 99 49532520 64643108 79385615 96062769 273726379 564565671 855141035 90
111 6706094 988546669 1018231313 1032696425 1197507444 1331250018 133853...
112
113 $$$$
114 ... ...
115 ... ...
116
117 Example of *FP* file containing extended connectivity fingerprints
118 string data:
119
120 #
121 # Package = MayaChemTools 7.4
122 # Release Date = Oct 21, 2010
123 #
124 # TimeStamp = Fri Mar 11 14:43:57 2011
125 #
126 # FingerprintsStringType = FingerprintsVector
127 #
128 # Description = ExtendedConnectivity:AtomicInvariantsAtomTypes:Radius2
129 # VectorStringFormat = ValuesString
130 # VectorValuesType = AlphaNumericalValues
131 #
132 Cmpd1 60;73555770 333564680 352413391 666191900 1001270906 137167432...
133 Cmpd2 41;73555770 333564680 666191900 1142173602 1363635752 14814699...
134 ... ...
135 ... ..
136
137 Example of CSV *Text* file containing extended connectivity fingerprints
138 string data:
139
140 "CompoundID","ExtendedConnectivityFingerprints"
141 "Cmpd1","FingerprintsVector;ExtendedConnectivity:AtomicInvariantsAtomTy
142 pes:Radius2;60;AlphaNumericalValues;ValuesString;73555770 333564680 352
143 413391 666191900 1001270906 1371674323 1481469939 1977749791 2006158649
144 2141408799 49532520 64643108 79385615 96062769 273726379 564565671 8551
145 41035 906706094 988546669 1018231313 1032696425 1197507444 13312500..."
146 ... ...
147 ... ...
148
149 The current release of MayaChemTools generates the following types of
150 extended connectivity fingerprints vector strings:
151
152 FingerprintsVector;ExtendedConnectivity:AtomicInvariantsAtomTypes:Radi
153 us2;60;AlphaNumericalValues;ValuesString;73555770 333564680 352413391
154 666191900 1001270906 1371674323 1481469939 1977749791 2006158649 21414
155 08799 49532520 64643108 79385615 96062769 273726379 564565671 85514103
156 5 906706094 988546669 1018231313 1032696425 1197507444 1331250018 1338
157 532734 1455473691 1607485225 1609687129 1631614296 1670251330 17303...
158
159 FingerprintsVector;ExtendedConnectivityCount:AtomicInvariantsAtomTypes
160 :Radius2;60;NumericalValues;IDsAndValuesString;73555770 333564680 3524
161 13391 666191900 1001270906 1371674323 1481469939 1977749791 2006158649
162 2141408799 49532520 64643108 79385615 96062769 273726379 564565671...;
163 3 2 1 1 14 1 2 10 4 3 1 1 1 1 2 1 2 1 1 1 2 3 1 1 2 1 3 3 8 2 2 2 6 2
164 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1
165
166 FingerprintsBitVector;ExtendedConnectivityBits:AtomicInvariantsAtomTyp
167 es:Radius2;1024;BinaryString;Ascending;0000000000000000000000000000100
168 0000000001010000000110000011000000000000100000000000000000000000100001
169 1000000110000000000000000000000000010011000000000000000000000000010000
170 0000000000000000000000000010000000000000000001000000000000000000000000
171 0000000000010000100001000000000000101000000000000000100000000000000...
172
173 FingerprintsBitVector;ExtendedConnectivityBits:AtomicInvariantsAtomTyp
174 es:Radius2;1024;HexadecimalString;Ascending;000000010050c0600800000803
175 0300000091000004000000020000100000000124008200020000000040020000000000
176 2080000000820040010020000000008040000000000080001000000000400000000000
177 4040000090000061010000000800200000000000001400000000020080000000000020
178 00008020200000408000
179
180 FingerprintsVector;ExtendedConnectivity:FunctionalClassAtomTypes:Radiu
181 s2;57;AlphaNumericalValues;ValuesString;24769214 508787397 850393286 8
182 62102353 981185303 1231636850 1649386610 1941540674 263599683 32920567
183 1 571109041 639579325 683993318 723853089 810600886 885767127 90326012
184 7 958841485 981022393 1126908698 1152248391 1317567065 1421489994 1455
185 632544 1557272891 1826413669 1983319256 2015750777 2029559552 20404...
186
187 FingerprintsVector;ExtendedConnectivityCount:FunctionalClassAtomTypes:
188 Radius2;57;NumericalValues;IDsAndValuesString;24769214 508787397 85039
189 3286 862102353 981185303 1231636850 1649386610 1941540674 263599683 32
190 9205671 571109041 639579325 683993318 723853089 810600886 885767127...;
191 1 1 1 10 2 22 3 1 3 3 1 1 1 3 2 2 1 2 2 2 3 1 1 1 1 1 14 1 1 1 1 1 1 2
192 1 2 1 1 2 2 1 1 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1
193
194 FingerprintsBitVector;ExtendedConnectivityBits:FunctionalClassAtomType
195 s:Radius2;1024;BinaryString;Ascending;00000000000000000000100000000000
196 0000000001000100000000001000000000000000000000000000000000101000000010
197 0000001000000000010000000000000000000000000000000000000000000000000100
198 0000000000001000000000000001000000000001001000000000000000000000000000
199 0000000000000000100000000000001000000000000000000000000000000000000...
200
201 FingerprintsVector;ExtendedConnectivity:DREIDINGAtomTypes:Radius2;56;A
202 lphaNumericalValues;ValuesString;280305427 357928343 721790579 1151822
203 898 1207111054 1380963747 1568213839 1603445250 4559268 55012922 18094
204 0813 335715751 534801009 684609658 829361048 972945982 999881534 10076
205 55741 1213692591 1222032501 1224517934 1235687794 1244268533 152812070
206 0 1629595024 1856308891 1978806036 2001865095 2096549435 172675415 ...
207
208 FingerprintsVector;ExtendedConnectivity:EStateAtomTypes:Radius2;62;Alp
209 haNumericalValues;ValuesString;25189973 528584866 662581668 671034184
210 926543080 1347067490 1738510057 1759600920 2034425745 2097234755 21450
211 44754 96779665 180364292 341712110 345278822 386540408 387387308 50430
212 1706 617094135 771528807 957666640 997798220 1158349170 1291258082 134
213 1138533 1395329837 1420277211 1479584608 1486476397 1487556246 1566...
214
215 FingerprintsVector;ExtendedConnectivity:MMFF94AtomTypes:Radius2;64;Alp
216 haNumericalValues;ValuesString;224051550 746527773 998750766 103704190
217 2 1239701709 1248384926 1259447756 1521678386 1631549126 1909437580 20
218 37095052 2104274756 2117729376 8770364 31445800 81450228 314289324 344
219 041929 581773587 638555787 692022098 811840536 929651561 936421792 988
220 636432 1048624296 1054288509 1369487579 1454058929 1519352190 17271...
221
222 FingerprintsVector;ExtendedConnectivity:SLogPAtomTypes:Radius2;71;Alph
223 aNumericalValues;ValuesString;78989290 116507218 489454042 888737940 1
224 162561799 1241797255 1251494264 1263717127 1471206899 1538061784 17654
225 07295 1795036542 1809833874 2020454493 2055310842 2117729376 11868981
226 56731842 149505242 184525155 196984339 288181334 481409282 556716568 6
227 41915747 679881756 721736571 794256218 908276640 992898760 10987549...
228
229 FingerprintsVector;ExtendedConnectivity:SYBYLAtomTypes:Radius2;58;Alph
230 aNumericalValues;ValuesString;199957044 313356892 455463968 465982819
231 1225318176 1678585943 1883366064 1963811677 2117729376 113784599 19153
232 8837 196629033 263865277 416380653 477036669 681527491 730724924 90906
233 5537 1021959189 1133014972 1174311016 1359441203 1573452838 1661585138
234 1668649038 1684198062 1812312554 1859266290 1891651106 2072549404 ...
235
236 FingerprintsVector;ExtendedConnectivity:TPSAAtomTypes:Radius2;47;Alpha
237 NumericalValues;ValuesString;20818206 259344053 862102353 1331904542 1
238 700688206 265614156 363161397 681332588 810600886 885767127 950172500
239 951454814 1059668746 1247054493 1382302230 1399502637 1805025917 19189
240 39561 2114677228 2126402271 8130483 17645742 32278373 149975755 160327
241 654 256360355 279492740 291251259 317592700 333763396 972105960 101...
242
243 FingerprintsVector;ExtendedConnectivity:UFFAtomTypes:Radius2;56;AlphaN
244 umericalValues;ValuesString;280305427 357928343 721790579 1151822898 1
245 207111054 1380963747 1568213839 1603445250 4559268 55012922 180940813
246 335715751 534801009 684609658 829361048 972945982 999881534 1007655741
247 1213692591 1222032501 1224517934 1235687794 1244268533 1528120700 162
248 9595024 1856308891 1978806036 2001865095 2096549435 172675415 18344...
249
250 OPTIONS
251 --AromaticityModel *MDLAromaticityModel | TriposAromaticityModel |
252 MMFFAromaticityModel | ChemAxonBasicAromaticityModel |
253 ChemAxonGeneralAromaticityModel | DaylightAromaticityModel |
254 MayaChemToolsAromaticityModel*
255 Specify aromaticity model to use during detection of aromaticity.
256 Possible values in the current release are: *MDLAromaticityModel,
257 TriposAromaticityModel, MMFFAromaticityModel,
258 ChemAxonBasicAromaticityModel, ChemAxonGeneralAromaticityModel,
259 DaylightAromaticityModel or MayaChemToolsAromaticityModel*. Default
260 value: *MayaChemToolsAromaticityModel*.
261
262 The supported aromaticity model names along with model specific
263 control parameters are defined in AromaticityModelsData.csv, which
264 is distributed with the current release and is available under
265 lib/data directory. Molecule.pm module retrieves data from this file
266 during class instantiation and makes it available to method
267 DetectAromaticity for detecting aromaticity corresponding to a
268 specific model.
269
270 -a, --AtomIdentifierType *AtomicInvariantsAtomTypes |
271 FunctionalClassAtomTypes | DREIDINGAtomTypes | EStateAtomTypes |
272 MMFF94AtomTypes | SLogPAtomTypes | SYBYLAtomTypes | TPSAAtomTypes |
273 UFFAtomTypes*
274 Specify atom identifier type to use for assignment of initial atom
275 identifier to non-hydrogen atoms during calculation of extended
276 connectivity fingerprints [ Ref 48, Ref 52]. Possible values in the
277 current release are: *AtomicInvariantsAtomTypes,
278 FunctionalClassAtomTypes, DREIDINGAtomTypes, EStateAtomTypes,
279 MMFF94AtomTypes, SLogPAtomTypes, SYBYLAtomTypes, TPSAAtomTypes,
280 UFFAtomTypes*. Default value: *AtomicInvariantsAtomTypes*.
281
282 --AtomicInvariantsToUse *"AtomicInvariant,AtomicInvariant..."*
283 This value is used during *AtomicInvariantsAtomTypes* value of a,
284 --AtomIdentifierType option. It's a list of comma separated valid
285 atomic invariant atom types.
286
287 Possible values for atomic invarians are: *AS, X, BO, LBO, SB, DB,
288 TB, H, Ar, RA, FC, MN, SM*. Default value [ Ref 24 ]:
289 *AS,X,BO,H,FC,MN*.
290
291 The atomic invariants abbreviations correspond to:
292
293 AS = Atom symbol corresponding to element symbol
294
295 X<n> = Number of non-hydrogen atom neighbors or heavy atoms
296 BO<n> = Sum of bond orders to non-hydrogen atom neighbors or heavy atoms
297 LBO<n> = Largest bond order of non-hydrogen atom neighbors or heavy atoms
298 SB<n> = Number of single bonds to non-hydrogen atom neighbors or heavy atoms
299 DB<n> = Number of double bonds to non-hydrogen atom neighbors or heavy atoms
300 TB<n> = Number of triple bonds to non-hydrogen atom neighbors or heavy atoms
301 H<n> = Number of implicit and explicit hydrogens for atom
302 Ar = Aromatic annotation indicating whether atom is aromatic
303 RA = Ring atom annotation indicating whether atom is a ring
304 FC<+n/-n> = Formal charge assigned to atom
305 MN<n> = Mass number indicating isotope other than most abundant isotope
306 SM<n> = Spin multiplicity of atom. Possible values: 1 (singlet), 2 (doublet) or
307 3 (triplet)
308
309 Atom type generated by AtomTypes::AtomicInvariantsAtomTypes class
310 corresponds to:
311
312 AS.X<n>.BO<n>.LBO<n>.<SB><n>.<DB><n>.<TB><n>.H<n>.Ar.RA.FC<+n/-n>.MN<n>.SM<n>
313
314 Except for AS which is a required atomic invariant in atom types,
315 all other atomic invariants are optional. Atom type specification
316 doesn't include atomic invariants with zero or undefined values.
317
318 In addition to usage of abbreviations for specifying atomic
319 invariants, the following descriptive words are also allowed:
320
321 X : NumOfNonHydrogenAtomNeighbors or NumOfHeavyAtomNeighbors
322 BO : SumOfBondOrdersToNonHydrogenAtoms or SumOfBondOrdersToHeavyAtoms
323 LBO : LargestBondOrderToNonHydrogenAtoms or LargestBondOrderToHeavyAtoms
324 SB : NumOfSingleBondsToNonHydrogenAtoms or NumOfSingleBondsToHeavyAtoms
325 DB : NumOfDoubleBondsToNonHydrogenAtoms or NumOfDoubleBondsToHeavyAtoms
326 TB : NumOfTripleBondsToNonHydrogenAtoms or NumOfTripleBondsToHeavyAtoms
327 H : NumOfImplicitAndExplicitHydrogens
328 Ar : Aromatic
329 RA : RingAtom
330 FC : FormalCharge
331 MN : MassNumber
332 SM : SpinMultiplicity
333
334 *AtomTypes::AtomicInvariantsAtomTypes* module is used to assign
335 atomic invariant atom types.
336
337 --BitsOrder *Ascending | Descending*
338 Bits order to use during generation of fingerprints bit-vector
339 string for *ExtendedConnectivityBits* value of -m, --mode option.
340 Possible values: *Ascending, Descending*. Default: *Ascending*.
341
342 *Ascending* bit order which corresponds to first bit in each byte as
343 the lowest bit as opposed to the highest bit.
344
345 Internally, bits are stored in *Ascending* order using Perl vec
346 function. Regardless of machine order, big-endian or little-endian,
347 vec function always considers first string byte as the lowest byte
348 and first bit within each byte as the lowest bit.
349
350 -b, --BitStringFormat *BinaryString | HexadecimalString*
351 Format of fingerprints bit-vector string data in output SD, FP or
352 CSV/TSV text file(s) specified by --output used during
353 *ExtendedConnectivityBits* value of -m, --mode option. Possible
354 values: *BinaryString, HexadecimalString*. Default value:
355 *BinaryString*.
356
357 *BinaryString* corresponds to an ASCII string containing 1s and 0s.
358 *HexadecimalString* contains bit values in ASCII hexadecimal format.
359
360 Examples:
361
362 FingerprintsBitVector;ExtendedConnectivityBits:AtomicInvariantsAtomTyp
363 es:Radius2;1024;BinaryString;Ascending;0000000000000000000000000000100
364 0000000001010000000110000011000000000000100000000000000000000000100001
365 1000000110000000000000000000000000010011000000000000000000000000010000
366 0000000000000000000000000010000000000000000001000000000000000000000000
367 0000000000010000100001000000000000101000000000000000100000000000000...
368
369 FingerprintsBitVector;ExtendedConnectivityBits:FunctionalClassAtomType
370 s:Radius2;1024;BinaryString;Ascending;00000000000000000000100000000000
371 0000000001000100000000001000000000000000000000000000000000101000000010
372 0000001000000000010000000000000000000000000000000000000000000000000100
373 0000000000001000000000000001000000000001001000000000000000000000000000
374 0000000000000000100000000000001000000000000000000000000000000000000...
375
376 --FunctionalClassesToUse *"FunctionalClass1,FunctionalClass2..."*
377 This value is used during *FunctionalClassAtomTypes* value of a,
378 --AtomIdentifierType option. It's a list of comma separated valid
379 functional classes.
380
381 Possible values for atom functional classes are: *Ar, CA, H, HBA,
382 HBD, Hal, NI, PI, RA*. Default value [ Ref 24 ]:
383 *HBD,HBA,PI,NI,Ar,Hal*.
384
385 The functional class abbreviations correspond to:
386
387 HBD: HydrogenBondDonor
388 HBA: HydrogenBondAcceptor
389 PI : PositivelyIonizable
390 NI : NegativelyIonizable
391 Ar : Aromatic
392 Hal : Halogen
393 H : Hydrophobic
394 RA : RingAtom
395 CA : ChainAtom
396
397 Functional class atom type specification for an atom corresponds to:
398
399 Ar.CA.H.HBA.HBD.Hal.NI.PI.RA
400
401 *AtomTypes::FunctionalClassAtomTypes* module is used to assign
402 functional class atom types. It uses following definitions [ Ref
403 60-61, Ref 65-66 ]:
404
405 HydrogenBondDonor: NH, NH2, OH
406 HydrogenBondAcceptor: N[!H], O
407 PositivelyIonizable: +, NH2
408 NegativelyIonizable: -, C(=O)OH, S(=O)OH, P(=O)OH
409
410 --CompoundID *DataFieldName or LabelPrefixString*
411 This value is --CompoundIDMode specific and indicates how compound
412 ID is generated.
413
414 For *DataField* value of --CompoundIDMode option, it corresponds to
415 datafield label name whose value is used as compound ID; otherwise,
416 it's a prefix string used for generating compound IDs like
417 LabelPrefixString<Number>. Default value, *Cmpd*, generates compound
418 IDs which look like Cmpd<Number>.
419
420 Examples for *DataField* value of --CompoundIDMode:
421
422 MolID
423 ExtReg
424
425 Examples for *LabelPrefix* or *MolNameOrLabelPrefix* value of
426 --CompoundIDMode:
427
428 Compound
429
430 The value specified above generates compound IDs which correspond to
431 Compound<Number> instead of default value of Cmpd<Number>.
432
433 --CompoundIDLabel *text*
434 Specify compound ID column label for FP or CSV/TSV text file(s) used
435 during *CompoundID* value of --DataFieldsMode option. Default:
436 *CompoundID*.
437
438 --CompoundIDMode *DataField | MolName | LabelPrefix |
439 MolNameOrLabelPrefix*
440 Specify how to generate compound IDs and write to FP or CSV/TSV text
441 file(s) along with generated fingerprints for *FP | text | all*
442 values of --output option: use a *SDFile(s)* datafield value; use
443 molname line from *SDFile(s)*; generate a sequential ID with
444 specific prefix; use combination of both MolName and LabelPrefix
445 with usage of LabelPrefix values for empty molname lines.
446
447 Possible values: *DataField | MolName | LabelPrefix |
448 MolNameOrLabelPrefix*. Default: *LabelPrefix*.
449
450 For *MolNameAndLabelPrefix* value of --CompoundIDMode, molname line
451 in *SDFile(s)* takes precedence over sequential compound IDs
452 generated using *LabelPrefix* and only empty molname values are
453 replaced with sequential compound IDs.
454
455 This is only used for *CompoundID* value of --DataFieldsMode option.
456
457 --DataFields *"FieldLabel1,FieldLabel2,..."*
458 Comma delimited list of *SDFiles(s)* data fields to extract and
459 write to CSV/TSV text file(s) along with generated fingerprints for
460 *text | all* values of --output option.
461
462 This is only used for *Specify* value of --DataFieldsMode option.
463
464 Examples:
465
466 Extreg
467 MolID,CompoundName
468
469 -d, --DataFieldsMode *All | Common | Specify | CompoundID*
470 Specify how data fields in *SDFile(s)* are transferred to output
471 CSV/TSV text file(s) along with generated fingerprints for *text |
472 all* values of --output option: transfer all SD data field; transfer
473 SD data files common to all compounds; extract specified data
474 fields; generate a compound ID using molname line, a compound
475 prefix, or a combination of both. Possible values: *All | Common |
476 specify | CompoundID*. Default value: *CompoundID*.
477
478 -f, --Filter *Yes | No*
479 Specify whether to check and filter compound data in SDFile(s).
480 Possible values: *Yes or No*. Default value: *Yes*.
481
482 By default, compound data is checked before calculating fingerprints
483 and compounds containing atom data corresponding to non-element
484 symbols or no atom data are ignored.
485
486 --FingerprintsLabel *text*
487 SD data label or text file column label to use for fingerprints
488 string in output SD or CSV/TSV text file(s) specified by --output.
489 Default value: *ExtendedConnectivityFingerprints*.
490
491 -h, --help
492 Print this help message.
493
494 -k, --KeepLargestComponent *Yes | No*
495 Generate fingerprints for only the largest component in molecule.
496 Possible values: *Yes or No*. Default value: *Yes*.
497
498 For molecules containing multiple connected components, fingerprints
499 can be generated in two different ways: use all connected components
500 or just the largest connected component. By default, all atoms
501 except for the largest connected component are deleted before
502 generation of fingerprints.
503
504 -m, --mode *ExtendedConnectivity | ExtendedConnectivityCount |
505 ExtendedConnectivityBits*
506 Specify type of extended connectivity fingerprints to generate for
507 molecules in *SDFile(s)*. Possible values: *ExtendedConnectivity,
508 ExtendedConnecticityCount or ExtendedConnectivityBits*. Default
509 value: *ExtendedConnectivity*.
510
511 For *ExtendedConnnectivity* value of fingerprints -m, --mode, a
512 fingerprint vector containing unique atom identifiers constitute the
513 extended connectivity fingerprints of a molecule.
514
515 For *ExtendedConnnectivityCount* value of fingerprints -m, --mode, a
516 fingerprint vector containing unique atom identifiers along with
517 their count constitute the extended connectivity fingerprints of a
518 molecule.
519
520 For *ExtendedConnnectivityBits* value of fingerprints -m, --mode, a
521 fingerprint bit vector indicating presence/absence of structurally
522 unique atom identifiers constitute the extended connectivity
523 fingerprints of a molecule.
524
525 -n, --NeighborhoodRadius *number*
526 Atomic neighborhood radius for generating extended connectivity
527 neighborhoods. Default value: *2*. Valid values: >= 0. Neighborhood
528 radius of zero correspond to just the list of non-hydrogen atoms.
529
530 Default value of *2* for atomic neighborhood radius generates
531 extended connectivity fingerprints corresponding to path length or
532 diameter value of *4* [ Ref 52b ].
533
534 --OutDelim *comma | tab | semicolon*
535 Delimiter for output CSV/TSV text file(s). Possible values: *comma,
536 tab, or semicolon* Default value: *comma*.
537
538 --output *SD | FP | text | all*
539 Type of output files to generate. Possible values: *SD, FP, text, or
540 all*. Default value: *text*.
541
542 -o, --overwrite
543 Overwrite existing files.
544
545 -q, --quote *Yes | No*
546 Put quote around column values in output CSV/TSV text file(s).
547 Possible values: *Yes or No*. Default value: *Yes*.
548
549 -r, --root *RootName*
550 New file name is generated using the root: <Root>.<Ext>. Default for
551 new file names: <SDFileName><ExtendedConnectivityFP>.<Ext>. The file
552 type determines <Ext> value. The sdf, fpf, csv, and tsv <Ext> values
553 are used for SD, FP, comma/semicolon, and tab delimited text files,
554 respectively.This option is ignored for multiple input files.
555
556 -s, --size *number*
557 Size of bit-vector to use during generation of fingerprints
558 bit-vector string for *ExtendedConnectivityBits* value of -m,
559 --mode. Default value: *1024*. Valid values correspond to any
560 positive integer which satisfies the following criteria: power of 2,
561 >= 32 and <= 2 ** 32.
562
563 Examples:
564
565 512
566 1024
567 2048
568
569 --UsePerlCoreRandom *Yes | No*
570 Specify whether to use Perl CORE::rand or MayaChemTools
571 MathUtil::random function during random number generation for
572 setting bits in fingerprints bit-vector strings. Possible values:
573 *Yes or No*. Default value: *Yes*.
574
575 *No* value option for --UsePerlCoreRandom allows the generation of
576 fingerprints bit-vector strings which are same across different
577 platforms.
578
579 The random number generator implemented in MayaChemTools is a
580 variant of linear congruential generator (LCG) as described by
581 Miller et al. [ Ref 120 ]. It is also referred to as Lehmer random
582 number generator or Park-Miller random number generator.
583
584 Unlike Perl's core random number generator function rand, the random
585 number generator implemented in MayaChemTools, MathUtil::random,
586 generates consistent random values across different platforms for a
587 specific random seed and leads to generation of portable
588 fingerprints bit-vector strings.
589
590 -v, --VectorStringFormat *ValuesString | IDsAndValuesString |
591 IDsAndValuesPairsString | ValuesAndIDsString | ValuesAndIDsPairsString*
592 Format of fingerprints vector string data in output SD, FP or
593 CSV/TSV text file(s) specified by --output used during
594 <ExtendedConnectivityCount> value of -m, --mode option. Possible
595 values: *ValuesString, IDsAndValuesString | IDsAndValuesPairsString
596 | ValuesAndIDsString | ValuesAndIDsPairsString*.
597
598 Default value during <ExtendedConnectivityCount> value of -m, --mode
599 option: *IDsAndValuesString*.
600
601 Default value during <ExtendedConnectivity> value of -m, --mode
602 option: *ValuesString*.
603
604 Examples:
605
606 FingerprintsVector;ExtendedConnectivity:AtomicInvariantsAtomTypes:Radi
607 us2;60;AlphaNumericalValues;ValuesString;73555770 333564680 352413391
608 666191900 1001270906 1371674323 1481469939 1977749791 2006158649 21414
609 08799 49532520 64643108 79385615 96062769 273726379 564565671 85514103
610 5 906706094 988546669 1018231313 1032696425 1197507444 1331250018 1338
611 532734 1455473691 1607485225 1609687129 1631614296 1670251330 17303...
612
613 FingerprintsVector;ExtendedConnectivityCount:AtomicInvariantsAtomTypes
614 :Radius2;60;NumericalValues;IDsAndValuesString;73555770 333564680 3524
615 13391 666191900 1001270906 1371674323 1481469939 1977749791 2006158649
616 2141408799 49532520 64643108 79385615 96062769 273726379 564565671...;
617 3 2 1 1 14 1 2 10 4 3 1 1 1 1 2 1 2 1 1 1 2 3 1 1 2 1 3 3 8 2 2 2 6 2
618 1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1
619
620 -w, --WorkingDir *DirName*
621 Location of working directory. Default: current directory.
622
623 EXAMPLES
624 To generate extended connectivity fingerprints corresponding to
625 neighborhood radius up to 2 using atomic invariants atom types in vector
626 string format and create a SampleECAIFP.csv file containing sequential
627 compound IDs along with fingerprints vector strings data, type:
628
629 % ExtendedConnectivityFingerprints.pl -r SampleECAIFP -o Sample.sdf
630
631 To generate extended connectivity count fingerprints corresponding to
632 neighborhood radius up to 2 using atomic invariants atom types in vector
633 string format and create a SampleECAIFP.csv file containing sequential
634 compound IDs along with fingerprints vector strings data, type:
635
636 % ExtendedConnectivityFingerprints.pl -m ExtendedConnectivityCount
637 -r SampleECAIFP -o Sample.sdf
638
639 To generate extended connectivity bits fingerprints as hexadecimal
640 bit-string corresponding to neighborhood radius up to 2 using atomic
641 invariants atom types in vector string format and create a
642 SampleECAIFP.csv file containing sequential compound IDs along with
643 fingerprints vector strings data, type:
644
645 % ExtendedConnectivityFingerprints.pl -m ExtendedConnectivityBits
646 -r SampleECAIFP -o Sample.sdf
647
648 To generate extended connectivity bits fingerprints as binary bit-string
649 corresponding to neighborhood radius up to 2 using atomic invariants
650 atom types in vector string format and create a SampleECAIFP.csv file
651 containing sequential compound IDs along with fingerprints vector
652 strings data, type:
653
654 % ExtendedConnectivityFingerprints.pl -m ExtendedConnectivityBits
655 --BitStringFormat BinaryString -r SampleECAIFP -o Sample.sdf
656
657 To generate extended connectivity fingerprints corresponding to
658 neighborhood radius up to 2 using atomic invariants atom types in vector
659 string format and create SampleECAIFP.sdf, SampleECAIFP.fpf and
660 SampleECAIFP.csv files containing sequential compound IDs in CSV file
661 along with fingerprints vector strings data, type:
662
663 % ExtendedConnectivityFingerprints.pl --output all -r SampleECAIFP
664 -o Sample.sdf
665
666 To generate extended connectivity count fingerprints corresponding to
667 neighborhood radius up to 2 using atomic invariants atom types in vector
668 string format and create SampleECAIFP.sdf, SampleECAIFP.fpf and
669 SampleECAIFP.csv files containing sequential compound IDs in CSV file
670 along with fingerprints vector strings data, type:
671
672 % ExtendedConnectivityFingerprints.pl -m ExtendedConnectivityCount
673 --output all -r SampleECAIFP -o Sample.sdf
674
675 To generate extended connectivity fingerprints corresponding to
676 neighborhood radius up to 2 using functional class atom types in vector
677 string format and create a SampleECFCFP.csv file containing sequential
678 compound IDs along with fingerprints vector strings data, type:
679
680 % ExtendedConnectivityFingerprints.pl -a FunctionalClassAtomTypes
681 -r SampleECFCFP -o Sample.sdf
682
683 To generate extended connectivity fingerprints corresponding to
684 neighborhood radius up to 2 using DREIDING atom types in vector string
685 format and create a SampleECFP.csv file containing sequential compound
686 IDs along with fingerprints vector strings data, type:
687
688 % ExtendedConnectivityFingerprints.pl -a DREIDINGAtomTypes
689 -r SampleECFP -o Sample.sdf
690
691 To generate extended connectivity fingerprints corresponding to
692 neighborhood radius up to 2 using E-state atom types in vector string
693 format and create a SampleECFP.csv file containing sequential compound
694 IDs along with fingerprints vector strings data, type:
695
696 % ExtendedConnectivityFingerprints.pl -a EStateAtomTypes
697 -r SampleECFP -o Sample.sdf
698
699 To generate extended connectivity fingerprints corresponding to
700 neighborhood radius up to 2 using MMFF94 atom types in vector string
701 format and create a SampleECFP.csv file containing sequential compound
702 IDs along with fingerprints vector strings data, type:
703
704 % ExtendedConnectivityFingerprints.pl -a MMFF94AtomTypes
705 -r SampleECFP -o Sample.sdf
706
707 To generate extended connectivity fingerprints corresponding to
708 neighborhood radius up to 2 using SLogP atom types in vector string
709 format and create a SampleECFP.csv file containing sequential compound
710 IDs along with fingerprints vector strings data, type:
711
712 % ExtendedConnectivityFingerprints.pl -a SLogPAtomTypes
713 -r SampleECFP -o Sample.sdf
714
715 To generate extended connectivity fingerprints corresponding to
716 neighborhood radius up to 2 using SYBYL atom types in vector string
717 format and create a SampleECFP.csv file containing sequential compound
718 IDs along with fingerprints vector strings data, type:
719
720 % ExtendedConnectivityFingerprints.pl -a SYBYLAtomTypes
721 -r SampleECFP -o Sample.sdf
722
723 To generate extended connectivity fingerprints corresponding to
724 neighborhood radius up to 2 using TPSA atom types in vector string
725 format and create a SampleECFP.csv file containing sequential compound
726 IDs along with fingerprints vector strings data, type:
727
728 % ExtendedConnectivityFingerprints.pl -a TPSAAtomTypes
729 -r SampleECFP -o Sample.sdf
730
731 To generate extended connectivity fingerprints corresponding to
732 neighborhood radius up to 2 using UFF atom types in vector string format
733 and create a SampleECFP.csv file containing sequential compound IDs
734 along with fingerprints vector strings data, type:
735
736 % ExtendedConnectivityFingerprints.pl -a UFFAtomTypes
737 -r SampleECFP -o Sample.sdf
738
739 To generate extended connectivity fingerprints corresponding to
740 neighborhood radius up to 3 using atomic invariants atom types in vector
741 string format and create a SampleECAIFP.csv file containing sequential
742 compound IDs along with fingerprints vector strings data, type:
743
744 % ExtendedConnectivityFingerprints.pl -a AtomicInvariantsAtomTypes -n 3
745 -r SampleECAIFP -o Sample.sdf
746
747 To generate extended connectivity fingerprints corresponding to
748 neighborhood radius up to 3 using functional class atom types in vector
749 string format and create a SampleECFCFP.csv file containing sequential
750 compound IDs along with fingerprints vector strings data, type:
751
752 % ExtendedConnectivityFingerprints.pl -a FunctionalClassAtomTypes -n 3
753 -r SampleECFCFP -o Sample.sdf
754
755 To generate extended connectivity fingerprints corresponding to
756 neighborhood radius up to 2 using only AS,X atomic invariants atom types
757 in vector string format and create a SampleECAIFP.csv file containing
758 sequential compound IDs along with fingerprints vector strings data,
759 type:
760
761 % ExtendedConnectivityFingerprints.pl -a AtomicInvariantsAtomTypes
762 --AtomicInvariantsToUse "AS,X" -r SampleECAIFP -o Sample.sdf
763
764 To generate extended connectivity fingerprints corresponding to
765 neighborhood radius up to 2 using only HBD,HBA functional class atom
766 types in vector string format and create a SampleECFCFP.csv file
767 containing sequential compound IDs along with fingerprints vector
768 strings data, type:
769
770 % ExtendedConnectivityFingerprints.pl -a FunctionalClassAtomTypes
771 --FunctionalClassesToUse "HBD,HBA" -r SampleECFCFP -o Sample.sdf
772
773 To generate extended connectivity fingerprints corresponding to
774 neighborhood radius up to 2 using atomic invariants atom types in vector
775 string format and create a SampleECAIFP.csv file containing compound ID
776 from molecule name line along with fingerprints vector strings data,
777 type:
778
779 % ExtendedConnectivityFingerprints.pl -a AtomicInvariantsAtomTypes
780 --DataFieldsMode CompoundID -CompoundIDMode MolName
781 -r SampleECAIFP -o Sample.sdf
782
783 To generate extended connectivity fingerprints corresponding to
784 neighborhood radius up to 2 using functional class atom types in vector
785 string format and create a SampleECFCFP.csv file containing compound IDs
786 using specified data field along with fingerprints vector strings data,
787 type:
788
789 % ExtendedConnectivityFingerprints.pl -a FunctionalClassAtomTypes
790 --DataFieldsMode CompoundID -CompoundIDMode DataField --CompoundID Mol_ID
791 -r SampleECFCFP -o Sample.sdf
792
793 To generate extended connectivity fingerprints corresponding to
794 neighborhood radius up to 2 using atomic invariants atom types in vector
795 string format and create a SampleECAIFP.tsv file containing compound ID
796 using combination of molecule name line and an explicit compound prefix
797 along with fingerprints vector strings data, type:
798
799 % ExtendedConnectivityFingerprints.pl -a AtomicInvariantsAtomTypes
800 --DataFieldsMode CompoundID -CompoundIDMode MolnameOrLabelPrefix
801 --CompoundID Cmpd --CompoundIDLabel MolID -r SampleECAIFP -o Sample.sdf
802
803 To generate extended connectivity fingerprints corresponding to
804 neighborhood radius up to 2 using functional class atom types in vector
805 string format and create a SampleECFCFP.csv file containing specific
806 data fields columns along with fingerprints vector strings data, type:
807
808 % ExtendedConnectivityFingerprints.pl -a FunctionalClassAtomTypes
809 --DataFieldsMode Specify --DataFields Mol_ID -r SampleECFCFP
810 -o Sample.sdf
811
812 To generate extended connectivity fingerprints corresponding to
813 neighborhood radius up to 2 using atomic invariants atom types in vector
814 string format and create a SampleECAIFP.tsv file containing common data
815 fields columns along with fingerprints vector strings data, type:
816
817 % ExtendedConnectivityFingerprints.pl -a AtomicInvariantsAtomTypes
818 --DataFieldsMode Common -r SampleECAIFP -o Sample.sdf
819
820 To generate extended connectivity fingerprints corresponding to
821 neighborhood radius up to 2 using functional class atom types in vector
822 string format and create SampleECFCFP.sdf, SampleECFCFP.fpf and
823 SampleECFCFP.csv files containing all data fields columns in CSV file
824 along with fingerprints vector strings data, type:
825
826 % ExtendedConnectivityFingerprints.pl -a FunctionalClassAtomTypes
827 --DataFieldsMode All --output all -r SampleECFCFP
828 -o Sample.sdf
829
830 AUTHOR
831 Manish Sud <msud@san.rr.com>
832
833 SEE ALSO
834 InfoFingerprintsFiles.pl, SimilarityMatricesFingerprints.pl,
835 AtomNeighborhoodsFingerprints.pl, MACCSKeysFingerprints.pl,
836 PathLengthFingerprints.pl, TopologicalAtomPairsFingerprints.pl,
837 TopologicalAtomTorsionsFingerprints.pl,
838 TopologicalPharmacophoreAtomPairsFingerprints.pl,
839 TopologicalPharmacophoreAtomTripletsFingerprints.pl
840
841 COPYRIGHT
842 Copyright (C) 2015 Manish Sud. All rights reserved.
843
844 This file is part of MayaChemTools.
845
846 MayaChemTools is free software; you can redistribute it and/or modify it
847 under the terms of the GNU Lesser General Public License as published by
848 the Free Software Foundation; either version 3 of the License, or (at
849 your option) any later version.
850