mayatool3_test2: docs/scripts/txt/SimilaritySearchingFingerprints.txt comparison

comparison docs/scripts/txt/SimilaritySearchingFingerprints.txt @ 0:4816e4a8ae95 draft default tip

Uploaded

author	deepakjadmin
date	Wed, 20 Jan 2016 09:23:18 -0500
parents
children

comparison

equal deleted inserted replaced

--1:000000000000
+:4816e4a8ae95
+NAME
+SimilaritySearchingFingerprints.pl - Perform similarity search using
+fingerprints strings data in SD, FP and CSV/TSV text file(s)
+SYNOPSIS
+SimilaritySearchingFingerprints.pl ReferenceFPFile DatabaseFPFile
+SimilaritySearchingFingerprints.pl [--alpha *number*] [--beta *number*]
+[-b, --BitVectorComparisonMode *TanimotoSimilarity | TverskySimilarity |
+...*] [--DatabaseColMode *ColNum | ColLabel*] [--DatabaseCompoundIDCol
+*col number | col name*] [--DatabaseCompoundIDPrefix *text*]
+[--DatabaseCompoundIDField *DataFieldName*] [--DatabaseCompoundIDMode
+*DataField | MolName | LabelPrefix | MolNameOrLabelPrefix*]
+[--DatabaseDataCols *"DataColNum1, DataColNum2,... " | DataColLabel1,
+DataCoLabel2,... "*] [--DatabaseDataColsMode *All | Specify |
+CompoundID*] [--DatabaseDataFields *"FieldLabel1, FieldLabel2,... "*]
+[--DatabaseDataFieldsMode *All | Common | Specify | CompoundID*]
+[--DatabaseFingerprintsCol *col number | col name*]
+[--DatabaseFingerprintsField *FieldLabel*] []--DistanceCutoff *number*]
+[-d, --detail *InfoLevel*] [-f, --fast] [--FingerprintsMode *AutoDetect
+| FingerprintsBitVectorString | FingerprintsVectorString*] [-g,
+--GroupFusionRule *Max, Mean, Median, Min, Sum, Euclidean*]
+[--GroupFusionApplyCutoff *Yes | No*] [-h, --help] [--InDelim *comma |
+semicolon*] [-k, --KNN *all | number*] [-m, --mode *IndividualReference
+| MultipleReferences*] [-n, --NumOfSimilarMolecules *number*]
+[--OutDelim *comma | tab | semicolon*] [--output *SD | text | both*]
+[-o, --overwrite] [-p, --PercentSimilarMolecules *number*] [--precision
+*number*] [-q, --quote *Yes | No*] [--ReferenceColMode *ColNum |
+ColLabel*] [--ReferenceCompoundIDCol *col number | col name*]
+[--ReferenceCompoundIDPrefix *text*] [--ReferenceCompoundIDField
+*DataFieldName*] [--ReferenceCompoundIDMode *DataField | MolName |
+LabelPrefix | MolNameOrLabelPrefix*] [--ReferenceFingerprintsCol *col
+number | col name*] [--ReferenceFingerprintsField *FieldLabel*] [-r,
+--root *RootName*] [-s, --SearchMode *SimilaritySearch |
+DissimilaritySearch*] [--SimilarCountMode *NumOfSimilar |
+PercentSimilar*] [--SimilarityCutoff *number*] [-v,
+--VectorComparisonMode *TanimotoSimilairy | ... | ManhattanDistance |
+...*] [--VectorComparisonFormulism *AlgebraicForm | BinaryForm |
+SetTheoreticForm*] [-w, --WorkingDir dirname] ReferenceFingerprintsFile
+DatabaseFingerprintsFile
+DESCRIPTION
+Perform molecular similarity search [ Ref 94-113 ] using fingerprint
+bit-vector or vector strings data in *SD, FP, or CSV/TSV text* files
+corresponding to *ReferenceFingerprintsFile* and
+*DatabaseFingerprintsFile*, and generate SD and CSV/TSV text file(s)
+containing database molecules which are similar to reference
+molecule(s). The reference molecules are also referred to as query or
+seed molecules and database molecules as target molecules in the
+literature.
+The current release of MayaChemTools supports two types of similarity
+search modes: *IndividualReference or MultipleReferences*. For default
+value of *MultipleReferences* for -m, --mode option, reference molecules
+are considered as a set and -g, --GroupFusionRule is used to calculate
+similarity of a database molecule against reference molecules set. The
+group fusion rule is also referred to as data fusion of consensus
+scoring in the literature. However, for *IndividualReference* value of
+-m, --mode option, reference molecules are treated as individual
+molecules and each reference molecule is compared against a database
+molecule by itself to identify similar molecules.
+The molecular dissimilarity search can also be performed using
+*DissimilaritySearch* value for -s, --SearchMode option. During
+dissimilarity search or usage of distance comparison coefficient in
+similarity similarity search, the meaning of fingerprints comparison
+value is automatically reversed as shown below:
+SeachMode      ComparisonCoefficient  ResultsSort   ComparisonValues
+Similarity     SimilarityCoefficient  Descending    Higher value imples
+high similarity
+Similarity     DistanceCoefficient    Ascending     Lower value implies
+high similarity
+Dissimilarity  SimilarityCoefficient  Ascending     Lower value implies
+high dissimilarity
+Dissimilarity  DistanceCoefficient    Descending    Higher value implies
+high dissimilarity
+During *IndividualReference* value of -m, --Mode option for similarity
+search, fingerprints bit-vector or vector string of each reference
+molecule is compared with database molecules using specified similarity
+or distance coefficients to identify most similar molecules for each
+reference molecule. Based on value of --SimilarCountMode, up to --n,
+--NumOfSimilarMolecules or -p, --PercentSimilarMolecules at specified
+--SimilarityCutoff or --DistanceCutoff are identified for each reference
+molecule.
+During *MultipleReferences* value -m, --mode option for similarity
+search, all reference molecules are considered as a set and -g,
+--GroupFusionRule is used to calculate similarity of a database molecule
+against reference molecules set either using all reference molecules or
+number of k-nearest neighbors (k-NN) to a database molecule specified
+using -k, --kNN. The fingerprints bit-vector or vector string of each
+reference molecule in a set is compared with a database molecule using a
+similarity or distance coefficient specified via -b,
+--BitVectorComparisonMode or -v, --VectorComparisonMode. The reference
+molecules whose comparison values with a database molecule fall outside
+specified --SimilarityCutoff or --DistanceCutoff are ignored during
+*Yes* value of --GroupFusionApplyCutoff. The specified -g,
+--GroupFusionRule is applied to -k, --kNN reference molecules to
+calculate final similarity value between a database molecule and
+reference molecules set.
+The input fingerprints *SD, FP, or Text (CSV/TSV)* files for
+*ReferenceFingerprintsFile* and *DatabaseTextFile* must contain valid
+fingerprint bit-vector or vector strings data corresponding to same type
+of fingerprints.
+The valid fingerprints *SDFile* extensions are *.sdf* and *.sd*. The
+valid fingerprints *FPFile* extensions are *.fpf* and *.fp*. The valid
+fingerprints *TextFile (CSV/TSV)* extensions are *.csv* and *.tsv* for
+comma/semicolon and tab delimited text files respectively. The --indelim
+option determines the format of *TextFile*. Any file which doesn't
+correspond to the format indicated by --indelim option is ignored.
+Example of *FP* file containing fingerprints bit-vector string data:
+#
+# Package = MayaChemTools 7.4
+# ReleaseDate = Oct 21, 2010
+#
+# TimeStamp =  Mon Mar 7 15:14:01 2011
+#
+# FingerprintsStringType = FingerprintsBitVector
+#
+# Description = PathLengthBits:AtomicInvariantsAtomTypes:MinLength1:...
+# Size = 1024
+# BitStringFormat = HexadecimalString
+# BitsOrder = Ascending
+#
+Cmpd1 9c8460989ec8a49913991a6603130b0a19e8051c89184414953800cc21510...
+Cmpd2 000000249400840040100042011001001980410c000000001010088001120...
+... ...
+... ..
+Example of *FP* file containing fingerprints vector string data:
+#
+# Package = MayaChemTools 7.4
+# ReleaseDate = Oct 21, 2010
+#
+# TimeStamp =  Mon Mar 7 15:14:01 2011
+#
+# FingerprintsStringType = FingerprintsVector
+#
+# Description = PathLengthBits:AtomicInvariantsAtomTypes:MinLength1:...
+# VectorStringFormat = IDsAndValuesString
+# VectorValuesType = NumericalValues
+#
+Cmpd1 338;C F N O C:C C:N C=O CC CF CN CO C:C:C C:C:N C:CC C:CF C:CN C:
+N:C C:NC CC:N CC=O CCC CCN CCO CNC NC=O O=CO C:C:C:C C:C:C:N C:C:CC...;
+33 1 2 5 21 2 2 12 1 3 3 20 2 10 2 2 1 2 2 2 8 2 5 1 1 1 19 2 8 2 2 2 2
+6 2 2 2 2 2 2 2 2 3 2 2 1 4 1 5 1 1 18 6 2 2 1 2 10 2 1 2 1 2 2 2 2 ...
+Cmpd2 103;C N O C=N C=O CC CN CO CC=O CCC CCN CCO CNC N=CN NC=O NCN O=C
+O C CC=O CCCC CCCN CCCO CCNC CNC=N CNC=O CNCN CCCC=O CCCCC CCCCN CC...;
+15 4 4 1 2 13 5 2 2 15 5 3 2 2 1 1 1 2 17 7 6 5 1 1 1 2 15 8 5 7 2 2 2 2
+1 2 1 1 3 15 7 6 8 3 4 4 3 2 2 1 2 3 14 2 4 7 4 4 4 4 1 1 1 2 1 1 1 ...
+... ...
+... ...
+Example of *SD* file containing fingerprints bit-vector string data:
+... ...
+... ...
+$$$$
+... ...
+... ...
+... ...
+41 44  0  0  0  0  0  0  0  0999 V2000
+-3.3652    1.4499    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
+... ...
+2  3  1  0  0  0  0
+... ...
+M  END
+>  <CmpdID>
+Cmpd1
+>  <PathLengthFingerprints>
+FingerprintsBitVector;PathLengthBits:AtomicInvariantsAtomTypes:MinLengt
+h1:MaxLength8;1024;HexadecimalString;Ascending;9c8460989ec8a49913991a66
+03130b0a19e8051c89184414953800cc2151082844a201042800130860308e8204d4028
+00831048940e44281c00060449a5000ac80c894114e006321264401600846c050164462
+08190410805000304a10205b0100e04c0038ba0fad0209c0ca8b1200012268b61c0026a
+aa0660a11014a011d46
+$$$$
+... ...
+... ...
+Example of CSV *TextFile* containing fingerprints bit-vector string
+data:
+"CompoundID","PathLengthFingerprints"
+"Cmpd1","FingerprintsBitVector;PathLengthBits:AtomicInvariantsAtomTypes
+:MinLength1:MaxLength8;1024;HexadecimalString;Ascending;9c8460989ec8a4
+9913991a6603130b0a19e8051c89184414953800cc2151082844a20104280013086030
+8e8204d402800831048940e44281c00060449a5000ac80c894114e006321264401..."
+... ...
+... ...
+The current release of MayaChemTools supports the following types of
+fingerprint bit-vector and vector strings:
+FingerprintsVector;AtomNeighborhoods:AtomicInvariantsAtomTypes:MinRadi
+us0:MaxRadius2;41;AlphaNumericalValues;ValuesString;NR0-C.X1.BO1.H3-AT
+C1:NR1-C.X3.BO3.H1-ATC1:NR2-C.X1.BO1.H3-ATC1:NR2-C.X3.BO4-ATC1 NR0-C.X
+1.BO1.H3-ATC1:NR1-C.X3.BO3.H1-ATC1:NR2-C.X1.BO1.H3-ATC1:NR2-C.X3.BO4-A
+TC1 NR0-C.X2.BO2.H2-ATC1:NR1-C.X2.BO2.H2-ATC1:NR1-C.X3.BO3.H1-ATC1:NR2
+-C.X2.BO2.H2-ATC1:NR2-N.X3.BO3-ATC1:NR2-O.X1.BO1.H1-ATC1 NR0-C.X2.B...
+FingerprintsVector;AtomTypesCount:AtomicInvariantsAtomTypes:ArbitraryS
+ize;10;NumericalValues;IDsAndValuesString;C.X1.BO1.H3 C.X2.BO2.H2 C.X2
+.BO3.H1 C.X3.BO3.H1 C.X3.BO4 F.X1.BO1 N.X2.BO2.H1 N.X3.BO3 O.X1.BO1.H1
+O.X1.BO2;2 4 14 3 10 1 1 1 3 2
+FingerprintsVector;AtomTypesCount:SLogPAtomTypes:ArbitrarySize;16;Nume
+ricalValues;IDsAndValuesString;C1 C10 C11 C14 C18 C20 C21 C22 C5 CS F
+N11 N4 O10 O2 O9;5 1 1 1 14 4 2 1 2 2 1 1 1 1 3 1
+FingerprintsVector;AtomTypesCount:SLogPAtomTypes:FixedSize;67;OrderedN
+umericalValues;IDsAndValuesString;C1 C2 C3 C4 C5 C6 C7 C8 C9 C10 C11 C
+12 C13 C14 C15 C16 C17 C18 C19 C20 C21 C22 C23 C24 C25 C26 C27 CS N1 N
+2 N3 N4 N5 N6 N7 N8 N9 N10 N11 N12 N13 N14 NS O1 O2 O3 O4 O5 O6 O7 O8
+O9 O10 O11 O12 OS F Cl Br I Hal P S1 S2 S3 Me1 Me2;5 0 0 0 2 0 0 0 0 1
+1 0 0 1 0 0 0 14 0 4 2 1 0 0 0 0 0 2 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0...
+FingerprintsVector;EStateIndicies:ArbitrarySize;11;NumericalValues;IDs
+AndValuesString;SaaCH SaasC SaasN SdO SdssC SsCH3 SsF SsOH SssCH2 SssN
+H SsssCH;24.778 4.387 1.993 25.023 -1.435 3.975 14.006 29.759 -0.073 3
+.024 -2.270
+FingerprintsVector;EStateIndicies:FixedSize;87;OrderedNumericalValues;
+ValuesString;0 0 0 0 0 0 0 3.975 0 -0.073 0 0 24.778 -2.270 0 0 -1.435
+4.387 0 0 0 0 0 0 3.024 0 0 0 0 0 0 0 1.993 0 29.759 25.023 0 0 0 0 1
+4.006 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0
+FingerprintsVector;ExtendedConnectivity:AtomicInvariantsAtomTypes:Radi
+us2;60;AlphaNumericalValues;ValuesString;73555770 333564680 352413391
+666191900 1001270906 1371674323 1481469939 1977749791 2006158649 21414
+08799 49532520 64643108 79385615 96062769 273726379 564565671 85514103
+5 906706094 988546669 1018231313 1032696425 1197507444 1331250018 1338
+532734 1455473691 1607485225 1609687129 1631614296 1670251330 17303...
+FingerprintsVector;ExtendedConnectivityCount:AtomicInvariantsAtomTypes
+:Radius2;60;NumericalValues;IDsAndValuesString;73555770 333564680 3524
+13391 666191900 1001270906 1371674323 1481469939 1977749791 2006158649
+2141408799 49532520 64643108 79385615 96062769 273726379 564565671...;
+3 2 1 1 14 1 2 10 4 3 1 1 1 1 2 1 2 1 1 1 2 3 1 1 2 1 3 3 8 2 2 2 6 2
+1 2 1 1 2 1 1 1 2 1 1 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1
+FingerprintsBitVector;ExtendedConnectivityBits:AtomicInvariantsAtomTyp
+es:Radius2;1024;BinaryString;Ascending;0000000000000000000000000000100
+0000000001010000000110000011000000000000100000000000000000000000100001
+1000000110000000000000000000000000010011000000000000000000000000010000
+0000000000000000000000000010000000000000000001000000000000000000000000
+0000000000010000100001000000000000101000000000000000100000000000000...
+FingerprintsVector;ExtendedConnectivity:FunctionalClassAtomTypes:Radiu
+s2;57;AlphaNumericalValues;ValuesString;24769214 508787397 850393286 8
+62102353 981185303 1231636850 1649386610 1941540674 263599683 32920567
+1 571109041 639579325 683993318 723853089 810600886 885767127 90326012
+7 958841485 981022393 1126908698 1152248391 1317567065 1421489994 1455
+632544 1557272891 1826413669 1983319256 2015750777 2029559552 20404...
+FingerprintsVector;ExtendedConnectivity:EStateAtomTypes:Radius2;62;Alp
+haNumericalValues;ValuesString;25189973 528584866 662581668 671034184
+926543080 1347067490 1738510057 1759600920 2034425745 2097234755 21450
+44754 96779665 180364292 341712110 345278822 386540408 387387308 50430
+1706 617094135 771528807 957666640 997798220 1158349170 1291258082 134
+1138533 1395329837 1420277211 1479584608 1486476397 1487556246 1566...
+FingerprintsBitVector;MACCSKeyBits;166;BinaryString;Ascending;00000000
+0000000000000000000000000000000001001000010010000000010010000000011100
+0100101010111100011011000100110110000011011110100110111111111111011111
+11111111111110111000
+FingerprintsBitVector;MACCSKeyBits;322;BinaryString;Ascending;11101011
+1110011111100101111111000111101100110000000000000011100010000000000000
+0000000000000000000000000000000000000000000000101000000000000000000000
+0000000000000000000000000000000000000000000000000000000000000000000000
+0000000000000000000000000000000000000011000000000000000000000000000000
+0000000000000000000000000000000000000000
+FingerprintsVector;MACCSKeyCount;166;OrderedNumericalValues;ValuesStri
+ng;0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 1 0 0 3 0 0 0 0 4 0 0 2 0 0 0 0 0 0 0 0 2 0 0 2 0 0 0 0
+0 0 0 0 1 1 8 0 0 0 1 0 0 1 0 1 0 1 0 3 1 3 1 0 0 0 1 2 0 11 1 0 0 0
+5 0 0 1 2 0 1 1 0 0 0 0 0 1 1 0 1 1 1 1 0 4 0 0 1 1 0 4 6 1 1 1 2 1 1
+3 5 2 2 0 5 3 5 1 1 2 5 1 2 1 2 4 8 3 5 5 2 2 0 3 5 4 1
+FingerprintsVector;MACCSKeyCount;322;OrderedNumericalValues;ValuesStri
+ng;14 8 2 0 2 0 4 4 2 1 4 0 0 2 5 10 5 2 1 0 0 2 0 5 13 3 28 5 5 3 0 0
+0 4 2 1 1 0 1 1 0 0 2 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 22 5 3 0 0 0 1 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11 0 2 0 0 0 0 0 0 0 0 0
+0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 ...
+FingerprintsBitVector;PathLengthBits:AtomicInvariantsAtomTypes:MinLeng
+th1:MaxLength8;1024;BinaryString;Ascending;001000010011010101011000110
+0100010101011000101001011100110001000010001001101000001001001001001000
+0010110100000111001001000001001010100100100000000011000000101001011100
+0010000001000101010100000100111100110111011011011000000010110111001101
+0101100011000000010001000011000010100011101100001000001000100000000...
+FingerprintsVector;PathLengthCount:AtomicInvariantsAtomTypes:MinLength
+1:MaxLength8;432;NumericalValues;IDsAndValuesPairsString;C.X1.BO1.H3 2
+C.X2.BO2.H2 4 C.X2.BO3.H1 14 C.X3.BO3.H1 3 C.X3.BO4 10 F.X1.BO1 1 N.X
+2.BO2.H1 1 N.X3.BO3 1 O.X1.BO1.H1 3 O.X1.BO2 2 C.X1.BO1.H3C.X3.BO3.H1
+2 C.X2.BO2.H2C.X2.BO2.H2 1 C.X2.BO2.H2C.X3.BO3.H1 4 C.X2.BO2.H2C.X3.BO
+4 1 C.X2.BO2.H2N.X3.BO3 1 C.X2.BO3.H1:C.X2.BO3.H1 10 C.X2.BO3.H1:C....
+FingerprintsVector;PathLengthCount:MMFF94AtomTypes:MinLength1:MaxLengt
+h8;463;NumericalValues;IDsAndValuesPairsString;C5A 2 C5B 2 C=ON 1 CB 1
+8 COO 1 CR 9 F 1 N5 1 NC=O 1 O=CN 1 O=CO 1 OC=O 1 OR 2 C5A:C5B 2 C5A:N
+5 2 C5ACB 1 C5ACR 1 C5B:C5B 1 C5BC=ON 1 C5BCB 1 C=ON=O=CN 1 C=ONNC=O 1
+CB:CB 18 CBF 1 CBNC=O 1 COO=O=CO 1 COOCR 1 COOOC=O 1 CRCR 7 CRN5 1 CR
+OR 2 C5A:C5B:C5B 2 C5A:C5BC=ON 1 C5A:C5BCB 1 C5A:N5:C5A 1 C5A:N5CR ...
+FingerprintsVector;TopologicalAtomPairs:AtomicInvariantsAtomTypes:MinD
+istance1:MaxDistance10;223;NumericalValues;IDsAndValuesString;C.X1.BO1
+.H3-D1-C.X3.BO3.H1 C.X2.BO2.H2-D1-C.X2.BO2.H2 C.X2.BO2.H2-D1-C.X3.BO3.
+H1 C.X2.BO2.H2-D1-C.X3.BO4 C.X2.BO2.H2-D1-N.X3.BO3 C.X2.BO3.H1-D1-...;
+2 1 4 1 1 10 8 1 2 6 1 2 2 1 2 1 2 2 1 2 1 5 1 10 12 2 2 1 2 1 9 1 3 1
+1 1 2 2 1 3 6 1 6 14 2 2 2 3 1 3 1 8 2 2 1 3 2 6 1 2 2 5 1 3 1 23 1...
+FingerprintsVector;TopologicalAtomPairs:FunctionalClassAtomTypes:MinDi
+stance1:MaxDistance10;144;NumericalValues;IDsAndValuesString;Ar-D1-Ar
+Ar-D1-Ar.HBA Ar-D1-HBD Ar-D1-Hal Ar-D1-None Ar.HBA-D1-None HBA-D1-NI H
+BA-D1-None HBA.HBD-D1-NI HBA.HBD-D1-None HBD-D1-None NI-D1-None No...;
+23 2 1 1 2 1 1 1 1 2 1 1 7 28 3 1 3 2 8 2 1 1 1 5 1 5 24 3 3 4 2 13 4
+1 1 4 1 5 22 4 4 3 1 19 1 1 1 1 1 2 2 3 1 1 8 25 4 5 2 3 1 26 1 4 1 ...
+FingerprintsVector;TopologicalAtomTorsions:AtomicInvariantsAtomTypes;3
+3;NumericalValues;IDsAndValuesString;C.X1.BO1.H3-C.X3.BO3.H1-C.X3.BO4-
+C.X3.BO4 C.X1.BO1.H3-C.X3.BO3.H1-C.X3.BO4-N.X3.BO3 C.X2.BO2.H2-C.X2.BO
+2.H2-C.X3.BO3.H1-C.X2.BO2.H2 C.X2.BO2.H2-C.X2.BO2.H2-C.X3.BO3.H1-O...;
+2 2 1 1 2 2 1 1 3 4 4 8 4 2 2 6 2 2 1 2 1 1 2 1 1 2 6 2 4 2 1 3 1
+FingerprintsVector;TopologicalAtomTorsions:EStateAtomTypes;36;Numerica
+lValues;IDsAndValuesString;aaCH-aaCH-aaCH-aaCH aaCH-aaCH-aaCH-aasC aaC
+H-aaCH-aasC-aaCH aaCH-aaCH-aasC-aasC aaCH-aaCH-aasC-sF aaCH-aaCH-aasC-
+ssNH aaCH-aasC-aasC-aasC aaCH-aasC-aasC-aasN aaCH-aasC-ssNH-dssC a...;
+4 4 8 4 2 2 6 2 2 2 4 3 2 1 3 3 2 2 2 1 2 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2
+FingerprintsVector;TopologicalAtomTriplets:AtomicInvariantsAtomTypes:M
+inDistance1:MaxDistance10;3096;NumericalValues;IDsAndValuesString;C.X1
+.BO1.H3-D1-C.X1.BO1.H3-D1-C.X3.BO3.H1-D2 C.X1.BO1.H3-D1-C.X2.BO2.H2-D1
+0-C.X3.BO4-D9 C.X1.BO1.H3-D1-C.X2.BO2.H2-D3-N.X3.BO3-D4 C.X1.BO1.H3-D1
+-C.X2.BO2.H2-D4-C.X2.BO2.H2-D5 C.X1.BO1.H3-D1-C.X2.BO2.H2-D6-C.X3....;
+1 2 2 2 2 2 2 2 8 8 4 8 4 4 2 2 2 2 4 2 2 2 4 2 2 2 2 1 2 2 4 4 4 2 2
+2 4 4 4 8 4 4 2 4 4 4 2 4 4 2 2 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 8...
+FingerprintsVector;TopologicalAtomTriplets:SYBYLAtomTypes:MinDistance1
+:MaxDistance10;2332;NumericalValues;IDsAndValuesString;C.2-D1-C.2-D9-C
+.3-D10 C.2-D1-C.2-D9-C.ar-D10 C.2-D1-C.3-D1-C.3-D2 C.2-D1-C.3-D10-C.3-
+D9 C.2-D1-C.3-D2-C.3-D3 C.2-D1-C.3-D2-C.ar-D3 C.2-D1-C.3-D3-C.3-D4 C.2
+-D1-C.3-D3-N.ar-D4 C.2-D1-C.3-D3-O.3-D2 C.2-D1-C.3-D4-C.3-D5 C.2-D1-C.
+3-D5-C.3-D6 C.2-D1-C.3-D5-O.3-D4 C.2-D1-C.3-D6-C.3-D7 C.2-D1-C.3-D7...
+FingerprintsVector;TopologicalPharmacophoreAtomPairs:ArbitrarySize:Min
+Distance1:MaxDistance10;54;NumericalValues;IDsAndValuesString;H-D1-H H
+-D1-NI HBA-D1-NI HBD-D1-NI H-D2-H H-D2-HBA H-D2-HBD HBA-D2-HBA HBA-D2-
+HBD H-D3-H H-D3-HBA H-D3-HBD H-D3-NI HBA-D3-NI HBD-D3-NI H-D4-H H-D4-H
+BA H-D4-HBD HBA-D4-HBA HBA-D4-HBD HBD-D4-HBD H-D5-H H-D5-HBA H-D5-...;
+18 1 2 1 22 12 8 1 2 18 6 3 1 1 1 22 13 6 5 7 2 28 9 5 1 1 1 36 16 10
+3 4 1 37 10 8 1 35 10 9 3 3 1 28 7 7 4 18 16 12 5 1 2 1
+FingerprintsVector;TopologicalPharmacophoreAtomPairs:FixedSize:MinDist
+ance1:MaxDistance10;150;OrderedNumericalValues;ValuesString;18 0 0 1 0
+0 0 2 0 0 1 0 0 0 0 22 12 8 0 0 1 2 0 0 0 0 0 0 0 0 18 6 3 1 0 0 0 1
+0 0 1 0 0 0 0 22 13 6 0 0 5 7 0 0 2 0 0 0 0 0 28 9 5 1 0 0 0 1 0 0 1 0
+0 0 0 36 16 10 0 0 3 4 0 0 1 0 0 0 0 0 37 10 8 0 0 0 0 1 0 0 0 0 0 0
+0 35 10 9 0 0 3 3 0 0 1 0 0 0 0 0 28 7 7 4 0 0 0 0 0 0 0 0 0 0 0 18...
+FingerprintsVector;TopologicalPharmacophoreAtomTriplets:ArbitrarySize:
+MinDistance1:MaxDistance10;696;NumericalValues;IDsAndValuesString;Ar1-
+Ar1-Ar1 Ar1-Ar1-H1 Ar1-Ar1-HBA1 Ar1-Ar1-HBD1 Ar1-H1-H1 Ar1-H1-HBA1 Ar1
+-H1-HBD1 Ar1-HBA1-HBD1 H1-H1-H1 H1-H1-HBA1 H1-H1-HBD1 H1-HBA1-HBA1 H1-
+HBA1-HBD1 H1-HBA1-NI1 H1-HBD1-NI1 HBA1-HBA1-NI1 HBA1-HBD1-NI1 Ar1-...;
+46 106 8 3 83 11 4 1 21 5 3 1 2 2 1 1 1 100 101 18 11 145 132 26 14 23
+28 3 3 5 4 61 45 10 4 16 20 7 5 1 3 4 5 3 1 1 1 1 5 4 2 1 2 2 2 1 1 1
+119 123 24 15 185 202 41 25 22 17 3 5 85 95 18 11 23 17 3 1 1 6 4 ...
+FingerprintsVector;TopologicalPharmacophoreAtomTriplets:FixedSize:MinD
+istance1:MaxDistance10;2692;OrderedNumericalValues;ValuesString;46 106
+8 3 0 0 83 11 4 0 0 0 1 0 0 0 0 0 0 0 0 21 5 3 0 0 1 2 2 0 0 1 0 0 0
+0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 100 101 18 11 0 0 145 132 26
+14 0 0 23 28 3 3 0 0 5 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 61 45 10 4 0
+0 16 20 7 5 1 0 3 4 5 3 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 5 ...
+OPTIONS
+--alpha *number*
+Value of alpha parameter for calculating *Tversky* similarity
+coefficient specified for -b, --BitVectorComparisonMode option. It
+corresponds to weights assigned for bits set to "1" in a pair of
+fingerprint bit-vectors during the calculation of similarity
+coefficient. Possible values: *0 to 1*. Default value: <0.5>.
+--beta *number*
+Value of beta parameter for calculating *WeightedTanimoto* and
+*WeightedTversky* similarity coefficients specified for -b,
+--BitVectorComparisonMode option. It is used to weight the
+contributions of bits set to "0" during the calculation of
+similarity coefficients. Possible values: *0 to 1*. Default value of
+<1> makes *WeightedTanimoto* and *WeightedTversky* equivalent to
+*Tanimoto* and *Tversky*.
+-b, --BitVectorComparisonMode *TanimotoSimilarity | TverskySimilarity |
+...*
+Specify what similarity coefficient to use for calculating
+similarity between fingerprints bit-vector string data values in
+*ReferenceFingerprintsFile* and *DatabaseFingerprintsFile* during
+similarity search. Possible values: *TanimotoSimilarity |
+TverskySimilarity | ...*. Default: *TanimotoSimilarity*
+The current release supports the following similarity coefficients:
+*BaroniUrbaniSimilarity, BuserSimilarity, CosineSimilarity,
+DiceSimilarity, DennisSimilarity, ForbesSimilarity,
+FossumSimilarity, HamannSimilarity, JacardSimilarity,
+Kulczynski1Similarity, Kulczynski2Similarity, MatchingSimilarity,
+McConnaugheySimilarity, OchiaiSimilarity, PearsonSimilarity,
+RogersTanimotoSimilarity, RussellRaoSimilarity, SimpsonSimilarity,
+SkoalSneath1Similarity, SkoalSneath2Similarity,
+SkoalSneath3Similarity, TanimotoSimilarity, TverskySimilarity,
+YuleSimilarity, WeightedTanimotoSimilarity,
+WeightedTverskySimilarity*. These similarity coefficients are
+described below.
+For two fingerprint bit-vectors A and B of same size, let:
+Na = Number of bits set to "1" in A
+Nb = Number of bits set to "1" in B
+Nc = Number of bits set to "1" in both A and B
+Nd = Number of bits set to "0" in both A and B
+Nt = Number of bits set to "1" or "0" in A or B (Size of A or B)
+Nt = Na + Nb - Nc + Nd
+Na - Nc = Number of bits set to "1" in A but not in B
+Nb - Nc = Number of bits set to "1" in B but not in A
+Then, various similarity coefficients [ Ref. 40 - 42 ] for a pair of
+bit-vectors A and B are defined as follows:
+*BaroniUrbaniSimilarity*: ( SQRT( Nc * Nd ) + Nc ) / ( SQRT ( Nc *
+Nd ) + Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as Buser )
+*BuserSimilarity*: ( SQRT ( Nc * Nd ) + Nc ) / ( SQRT ( Nc * Nd ) +
+Nc + ( Na - Nc ) + ( Nb - Nc ) ) ( same as BaroniUrbani )
+*CosineSimilarity*: Nc / SQRT ( Na * Nb ) (same as Ochiai)
+*DiceSimilarity*: (2 * Nc) / ( Na + Nb )
+*DennisSimilarity*: ( Nc * Nd - ( ( Na - Nc ) * ( Nb - Nc ) ) ) /
+SQRT ( Nt * Na * Nb)
+*ForbesSimilarity*: ( Nt * Nc ) / ( Na * Nb )
+*FossumSimilarity*: ( Nt * ( ( Nc - 1/2 ) ** 2 ) / ( Na * Nb )
+*HamannSimilarity*: ( ( Nc + Nd ) - ( Na - Nc ) - ( Nb - Nc ) ) / Nt
+*JaccardSimilarity*: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc / (
+Na + Nb - Nc ) (same as Tanimoto)
+*Kulczynski1Similarity*: Nc / ( ( Na - Nc ) + ( Nb - Nc) ) = Nc / (
+Na + Nb - 2Nc )
+*Kulczynski2Similarity*: ( ( Nc / 2 ) * ( 2 * Nc + ( Na - Nc ) + (
+Nb - Nc) ) ) / ( ( Nc + ( Na - Nc ) ) * ( Nc + ( Nb - Nc ) ) ) = 0.5
+* ( Nc / Na + Nc / Nb )
+*MatchingSimilarity*: ( Nc + Nd ) / Nt
+*McConnaugheySimilarity*: ( Nc ** 2 - ( Na - Nc ) * ( Nb - Nc) ) / (
+Na * Nb )
+*OchiaiSimilarity*: Nc / SQRT ( Na * Nb ) (same as Cosine)
+*PearsonSimilarity*: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) /
+SQRT ( Na * Nb * ( Na - Nc + Nd ) * ( Nb - Nc + Nd ) )
+*RogersTanimotoSimilarity*: ( Nc + Nd ) / ( ( Na - Nc) + ( Nb - Nc)
++ Nt) = ( Nc + Nd ) / ( Na + Nb - 2Nc + Nt)
+*RussellRaoSimilarity*: Nc / Nt
+*SimpsonSimilarity*: Nc / MIN ( Na, Nb)
+*SkoalSneath1Similarity*: Nc / ( Nc + 2 * ( Na - Nc) + 2 * ( Nb -
+Nc) ) = Nc / ( 2 * Na + 2 * Nb - 3 * Nc )
+*SkoalSneath2Similarity*: ( 2 * Nc + 2 * Nd ) / ( Nc + Nd + Nt )
+*SkoalSneath3Similarity*: ( Nc + Nd ) / ( ( Na - Nc ) + ( Nb - Nc )
+) = ( Nc + Nd ) / ( Na + Nb - 2 * Nc )
+*TanimotoSimilarity*: Nc / ( ( Na - Nc) + ( Nb - Nc ) + Nc ) = Nc /
+( Na + Nb - Nc ) (same as Jaccard)
+*TverskySimilarity*: Nc / ( alpha * ( Na - Nc ) + ( 1 - alpha) * (
+Nb - Nc) + Nc ) = Nc / ( alpha * ( Na - Nb ) + Nb)
+*YuleSimilarity*: ( ( Nc * Nd ) - ( ( Na - Nc ) * ( Nb - Nc ) ) ) /
+( ( Nc * Nd ) + ( ( Na - Nc ) * ( Nb - Nc ) ) )
+Values of Tanimoto/Jaccard and Tversky coefficients are dependent on
+only those bit which are set to "1" in both A and B. In order to
+take into account all bit positions, modified versions of Tanimoto [
+Ref. 42 ] and Tversky [ Ref. 43 ] have been developed.
+Let:
+Na' = Number of bits set to "0" in A
+Nb' = Number of bits set to "0" in B
+Nc' = Number of bits set to "0" in both A and B
+Tanimoto': Nc' / ( ( Na' - Nc') + ( Nb' - Nc' ) + Nc' ) = Nc' / (
+Na' + Nb' - Nc' )
+Tversky': Nc' / ( alpha * ( Na' - Nc' ) + ( 1 - alpha) * ( Nb' - Nc'
+) + Nc' ) = Nc' / ( alpha * ( Na' - Nb' ) + Nb')
+Then:
+*WeightedTanimotoSimilarity* = beta * Tanimoto + (1 - beta) *
+Tanimoto'
+*WeightedTverskySimilarity* = beta * Tversky + (1 - beta) * Tversky'
+--DatabaseColMode *ColNum | ColLabel*
+Specify how columns are identified in database fingerprints
+*TextFile*: using column number or column label. Possible values:
+*ColNum or ColLabel*. Default value: *ColNum*.
+--DatabaseCompoundIDCol *col number | col name*
+This value is --DatabaseColMode mode specific. It specifies column
+to use for retrieving compound ID from database fingerprints
+*TextFile* during similarity and dissimilarity search for output SD
+and CSV/TSV text files. Possible values: *col number or col label*.
+Default value: *first column containing the word compoundID in its
+column label or sequentially generated IDs*.
+This is only used for *CompoundID* value of --DatabaseDataColsMode
+option.
+--DatabaseCompoundIDPrefix *text*
+Specify compound ID prefix to use during sequential generation of
+compound IDs for database fingerprints *SDFile* and *TextFile*.
+Default value: *Cmpd*. The default value generates compound IDs
+which look like Cmpd<Number>.
+For database fingerprints *SDFile*, this value is only used during
+*LabelPrefix | MolNameOrLabelPrefix* values of
+--DatabaseCompoundIDMode option; otherwise, it's ignored.
+Examples for *LabelPrefix* or *MolNameOrLabelPrefix* value of
+--DatabaseCompoundIDMode:
+Compound
+The values specified above generates compound IDs which correspond
+to Compound<Number> instead of default value of Cmpd<Number>.
+--DatabaseCompoundIDField *DataFieldName*
+Specify database fingerprints *SDFile* datafield label for
+generating compound IDs. This value is only used during *DataField*
+value of --DatabaseCompoundIDMode option.
+Examples for *DataField* value of --DatabaseCompoundIDMode:
+MolID
+ExtReg
+--DatabaseCompoundIDMode *DataField | MolName | LabelPrefix |
+MolNameOrLabelPrefix*
+Specify how to generate compound IDs from database fingerprints
+*SDFile* during similarity and dissimilarity search for output SD
+and CSV/TSV text files: use a *SDFile* datafield value; use molname
+line from *SDFile*; generate a sequential ID with specific prefix;
+use combination of both MolName and LabelPrefix with usage of
+LabelPrefix values for empty molname lines.
+Possible values: *DataField | MolName | LabelPrefix |
+MolNameOrLabelPrefix*. Default: *LabelPrefix*.
+For *MolNameAndLabelPrefix* value of --DatabaseCompoundIDMode,
+molname line in *SDFile* takes precedence over sequential compound
+IDs generated using *LabelPrefix* and only empty molname values are
+replaced with sequential compound IDs.
+This is only used for *CompoundID* value of --DatabaseDataFieldsMode
+option.
+--DatabaseDataCols *"DataColNum1,DataColNum2,... " |
+DataColLabel1,DataCoLabel2,... "*
+This value is --DatabaseColMode mode specific. It is a comma
+delimited list of database fingerprints *TextFile* data column
+numbers or labels to extract and write to SD and CSV/TSV text files
+along with other information for *SD | text | both* values of
+--output option.
+This is only used for *Specify* value of --DatabaseDataColsMode
+option.
+Examples:
+1,2,3
+CompoundName,MolWt
+--DatabaseDataColsMode *All | Specify | CompoundID*
+Specify how data columns from database fingerprints *TextFile* are
+transferred to output SD and CSV/TSV text files along with other
+information for *SD | text | both* values of --output option:
+transfer all data columns; extract specified data columns; generate
+a compound ID database compound prefix. Possible values: *All |
+Specify | CompoundID*. Default value: *CompoundID*.
+--DatabaseDataFields *"FieldLabel1,FieldLabel2,... "*
+Comma delimited list of database fingerprints *SDFile* data fields
+to extract and write to SD and CSV/TSV text files along with other
+information for *SD | text | both* values of --output option.
+This is only used for *Specify* value of --DatabaseDataFieldsMode
+option.
+Examples:
+Extreg
+MolID,CompoundName
+--DatabaseDataFieldsMode *All | Common | Specify | CompoundID*
+Specify how data fields from database fingerprints *SDFile* are
+transferred to output SD and CSV/TSV text files along with other
+information for *SD | text | both* values of --output option:
+transfer all SD data field; transfer SD data files common to all
+compounds; extract specified data fields; generate a compound ID
+using molname line, a compound prefix, or a combination of both.
+Possible values: *All | Common | specify | CompoundID*. Default
+value: *CompoundID*.
+--DatabaseFingerprintsCol *col number | col name*
+This value is --DatabaseColMode specific. It specifies fingerprints
+column to use during similarity and dissimilarity search for
+database fingerprints *TextFile*. Possible values: *col number or
+col label*. Default value: *first column containing the word
+Fingerprints in its column label*.
+--DatabaseFingerprintsField *FieldLabel*
+Fingerprints field label to use during similarity and dissimilarity
+search for database fingerprints *SDFile*. Default value: *first
+data field label containing the word Fingerprints in its label*
+--DistanceCutoff *number*
+Distance cutoff value to use during comparison of distance value
+between a pair of database and reference molecule calculated by
+distance comparison methods for fingerprints vector string data
+values. Possible values: *Any valid number*. Default value: *10*.
+The comparison value between a pair of database and reference
+molecule must meet the cutoff criterion as shown below:
+SeachMode      CutoffCriterion  ComparisonValues
+Similarity     <=               Lower value implies high similarity
+Dissimilarity  >=               Higher value implies high dissimilarity
+This option is only used during distance coefficients values of -v,
+--VectorComparisonMode option.
+This option is ignored during *No* value of --GroupFusionApplyCutoff
+for *MultipleReferences* -m, --mode.
+-d, --detail *InfoLevel*
+Level of information to print about lines being ignored. Default:
+*1*. Possible values: *1, 2 or 3*.
+-f, --fast
+In this mode, fingerprints columns specified using --FingerprintsCol
+for reference and database fingerprints *TextFile(s)*, and
+--FingerprintsField for reference and database fingerprints
+*SDFile(s)* are assumed to contain valid fingerprints data and no
+checking is performed before performing similarity and dissimilarity
+search. By default, fingerprints data is validated before computing
+pairwise similarity and distance coefficients.
+--FingerprintsMode *AutoDetect | FingerprintsBitVectorString |
+FingerprintsVectorString*
+Format of fingerprint strings data in reference and database
+fingerprints *SD, FP, or Text (CSV/TSV)* files: automatically detect
+format of fingerprints string created by MayaChemTools fingerprints
+generation scripts or explicitly specify its format. Possible
+values: *AutoDetect | FingerprintsBitVectorString |
+FingerprintsVectorString*. Default value: *AutoDetect*.
+-g, --GroupFusionRule *Max, Min, Mean, Median, Sum, Euclidean*
+Specify what group fusion [ Ref 94-97, Ref 100, Ref 105 ] rule to
+use for calculating similarity of a database molecule against a set
+of reference molecules during *MultipleReferences* value of
+similarity search -m, --mode. Possible values: *Max, Min, Mean,
+Median, Sum, Euclidean*. Default value: *Max*. *Mean* value
+corresponds to average or arithmetic mean. The group fusion rule is
+also referred to as data fusion of consensus scoring in the
+literature.
+For a reference molecules set and a database molecule, let:
+N = Number of reference molecules in a set
+i = ith reference reference molecule in a set
+n = Nth reference reference molecule in a set
+d = dth database molecule
+Crd = Fingerprints comparison value between rth reference and dth database
+molecule - similarity/dissimilarity comparison using similarity or
+distance coefficient
+Then, various group fusion rules to calculate fused similarity
+between a database molecule and reference molecules set are defined
+as follows:
+Max: MAX ( C1d, C2d, ..., Cid, ..., Cnd )
+Min: MIN ( C1d, C2d, ..., Cid, ..., Cnd )
+Mean: SUM ( C1d, C2d, ..., Cid, ..., Cnd ) / N
+Median: MEDIAN ( C1d, C2d, ..., Cid, ..., Cnd )
+Sum: SUM ( C1d, C2d, ..., Cid, ..., Cnd )
+Euclidean: SQRT( SUM( C1d ** 2, C2d ** 2, ..., Cid ** 2, ..., Cnd
+*** 2) )
+The fingerprints bit-vector or vector string of each reference
+molecule in a set is compared with a database molecule using a
+similarity or distance coefficient specified via -b,
+--BitVectorComparisonMode or -v, --VectorComparisonMode. The
+reference molecules whose comparison values with a database molecule
+fall outside specified --SimilarityCutoff or --DistanceCutoff are
+ignored during *Yes* value of --GroupFusionApplyCutoff. The
+specified -g, --GroupFusionRule is applied to -k, --kNN reference
+molecules to calculate final fused similarity value between a
+database molecule and reference molecules set.
+During dissimilarity search or usage of distance comparison
+coefficient in similarity search, the meaning of fingerprints
+comaprison value is automatically reversed as shown below:
+SeachMode      ComparisonCoefficient  ComparisonValues
+Similarity     SimilarityCoefficient  Higher value imples high similarity
+Similarity     DistanceCoefficient    Lower value implies high similarity
+Dissimilarity  SimilarityCoefficient  Lower value implies high
+dissimilarity
+Dissimilarity  DistanceCoefficient    Higher value implies high
+dissimilarity
+Consequently, *Max* implies highest and lowest comparison value for
+usage of similarity and distance coefficient respectively during
+similarity search. And it corresponds to lowest and highest
+comparison value for usage of similarity and distance coefficient
+respectively during dissimilarity search. During *Min* fusion rule,
+the highest and lowest comparison values are appropriately reversed.
+--GroupFusionApplyCutoff *Yes | No*
+Specify whether to apply --SimilarityCutoff or --DistanceCutoff
+values during application of -g, --GroupFusionRule to reference
+molecules set. Possible values: *Yes or No*. Default value: *Yes*.
+During *Yes* value of --GroupFusionApplyCutoff, the reference
+molecules whose comparison values with a database molecule fall
+outside specified --SimilarityCutoff or --DistanceCutoff are not
+used to calculate final fused similarity value between a database
+molecule and reference molecules set.
+-h, --help
+Print this help message.
+--InDelim *comma | semicolon*
+Input delimiter for reference and database fingerprints CSV
+*TextFile(s)*. Possible values: *comma or semicolon*. Default value:
+*comma*. For TSV files, this option is ignored and *tab* is used as
+a delimiter.
+-k, --kNN *all | number*
+Number of k-nearest neighbors (k-NN) reference molecules to use
+during -g, --GroupFusionRule for calculating similarity of a
+database molecule against a set of reference molecules. Possible
+values: *all | positive integers*. Default: *all*.
+After ranking similarity values between a database molecule and
+reference molecules during *MultipleReferences* value of similarity
+search -m, --mode option, a top -k, --KNN reference molecule are
+selected and used during -g, --GroupFusionRule.
+This option is -s, --SearchMode dependent: It corresponds to
+dissimilar molecules during *DissimilaritySearch* value of -s,
+--SearchMode option.
+-m, --mode *IndividualReference | MultipleReferences*
+Specify how to treat reference molecules in
+*ReferenceFingerprintsFile* during similarity search: Treat each
+reference molecule individually during similarity search or perform
+similarity search by treating multiple reference molecules as a set.
+Possible values: *IndividualReference | MultipleReferences*. Default
+value: *MultipleReferences*.
+During *IndividualReference* value of -m, --Mode for similarity
+search, fingerprints bit-vector or vector string of each reference
+molecule is compared with database molecules using specified
+similarity or distance coefficients to identify most similar
+molecules for each reference molecule. Based on value of
+--SimilarCountMode, upto --n, NumOfSimilarMolecules or -p,
+--PercentSimilarMolecules at specified <--SimilarityCutoff> or
+--DistanceCutoff are identified for each reference molecule.
+During *MultipleReferences* value -m, --mode for similarity search,
+all reference molecules are considered as a set and -g,
+--GroupFusionRule is used to calculate similarity of a database
+molecule against reference molecules set either using all reference
+molecules or number of k-nearest neighbors (k-NN) to a database
+molecule specified using -k, --kNN. The fingerprints bit-vector or
+vector string of each reference molecule in a set is compared with a
+database molecule using a similarity or distance coefficient
+specified via -b, --BitVectorComparisonMode or -v,
+--VectorComparisonMode. The reference molecules whose comparison
+values with a database molecule fall outside specified
+--SimilarityCutoff or --DistanceCutoff are ignored. The specified
+-g, --GroupFusionRule is applied to rest of -k, --kNN reference
+molecules to calculate final similarity value between a database
+molecule and reference molecules set.
+The meaning of similarity and distance is automatically reversed
+during *DissimilaritySearch* value of -s, --SearchMode along with
+appropriate handling of --SimilarityCutoff or --DistanceCutoff
+values.
+-n, --NumOfSimilarMolecules *number*
+Maximum number of most similar database molecules to find for each
+reference molecule or set of reference molecules based on
+*IndividualReference* or *MultipleReferences* value of similarity
+search -m, --mode option. Default: *10*. Valid values: positive
+integers.
+This option is ignored during *PercentSimilar* value of
+--SimilarCountMode option.
+This option is -s, --SearchMode dependent: It corresponds to
+dissimilar molecules during *DissimilaritySearch* value of -s,
+--SearchMode option.
+--OutDelim *comma | tab | semicolon*
+Delimiter for output CSV/TSV text file. Possible values: *comma,
+tab, or semicolon* Default value: *comma*.
+--output *SD | text | both*
+Type of output files to generate. Possible values: *SD, text, or
+both*. Default value: *text*.
+-o, --overwrite
+Overwrite existing files
+-p, --PercentSimilarMolecules *number*
+Maximum percent of mosy similar database molecules to find for each
+reference molecule or set of reference molecules based on
+*IndividualReference* or *MultipleReferences* value of similarity
+search -m, --mode option. Default: *1* percent of database
+molecules. Valid values: non-zero values in between *0 to 100*.
+This option is ignored during *NumOfSimilar* value of
+--SimilarCountMode option.
+During *PercentSimilar* value of --SimilarCountMode option, the
+number of molecules in *DatabaseFingerprintsFile* is counted and
+number of similar molecules correspond to --PercentSimilarMolecules
+of the total number of database molecules.
+This option is -s, --SearchMode dependent: It corresponds to
+dissimilar molecules during *DissimilaritySearch* value of -s,
+--SearchMode option.
+--precision *number*
+Precision of calculated similarity values for comparison and
+generating output files. Default: up to *2* decimal places. Valid
+values: positive integers.
+-q, --quote *Yes | No*
+Put quote around column values in output CSV/TSV text file. Possible
+values: *Yes or No*. Default value: *Yes*.
+--ReferenceColMode *ColNum | ColLabel*
+Specify how columns are identified in reference fingerprints
+*TextFile*: using column number or column label. Possible values:
+*ColNum or ColLabel*. Default value: *ColNum*.
+--ReferenceCompoundIDCol *col number | col name*
+This value is --ReferenceColMode mode specific. It specifies column
+to use for retrieving compound ID from reference fingerprints
+*TextFile* during similarity and dissimilarity search for output SD
+and CSV/TSV text files. Possible values: *col number or col label*.
+Default value: *first column containing the word compoundID in its
+column label or sequentially generated IDs*.
+--ReferenceCompoundIDPrefix *text*
+Specify compound ID prefix to use during sequential generation of
+compound IDs for reference fingerprints *SDFile* and *TextFile*.
+Default value: *Cmpd*. The default value generates compound IDs
+which looks like Cmpd<Number>.
+For reference fingerprints *SDFile*, this value is only used during
+*LabelPrefix | MolNameOrLabelPrefix* values of
+--ReferenceCompoundIDMode option; otherwise, it's ignored.
+Examples for *LabelPrefix* or *MolNameOrLabelPrefix* value of
+--DatabaseCompoundIDMode:
+Compound
+The values specified above generates compound IDs which correspond
+to Compound<Number> instead of default value of Cmpd<Number>.
+--ReferenceCompoundIDField *DataFieldName*
+Specify reference fingerprints *SDFile* datafield label for
+generating compound IDs. This value is only used during *DataField*
+value of --ReferenceCompoundIDMode option.
+Examples for *DataField* value of --ReferenceCompoundIDMode:
+MolID
+ExtReg
+--ReferenceCompoundIDMode *DataField | MolName | LabelPrefix |
+MolNameOrLabelPrefix*
+Specify how to generate compound IDs from reference fingerprints
+*SDFile* during similarity and dissimilarity search for output SD
+and CSV/TSV text files: use a *SDFile* datafield value; use molname
+line from *SDFile*; generate a sequential ID with specific prefix;
+use combination of both MolName and LabelPrefix with usage of
+LabelPrefix values for empty molname lines.
+Possible values: *DataField | MolName | LabelPrefix |
+MolNameOrLabelPrefix*. Default: *LabelPrefix*.
+For *MolNameAndLabelPrefix* value of --ReferenceCompoundIDMode,
+molname line in *SDFiles* takes precedence over sequential compound
+IDs generated using *LabelPrefix* and only empty molname values are
+replaced with sequential compound IDs.
+--ReferenceFingerprintsCol *col number | col name*
+This value is --ReferenceColMode specific. It specifies fingerprints
+column to use during similarity and dissimilarity search for
+reference fingerprints *TextFile*. Possible values: *col number or
+col label*. Default value: *first column containing the word
+Fingerprints in its column label*.
+--ReferenceFingerprintsField *FieldLabel*
+Fingerprints field label to use during similarity and dissimilarity
+search for reference fingerprints *SDFile*. Default value: *first
+data field label containing the word Fingerprints in its label*
+-r, --root *RootName*
+New file name is generated using the root: <Root>.<Ext>. Default for
+new file name: <ReferenceFileName>SimilaritySearching.<Ext>. The
+output file type determines <Ext> value. The sdf, csv, and tsv <Ext>
+values are used for SD, comma/semicolon, and tab delimited text
+files respectively.
+-s, --SearchMode *SimilaritySearch | DissimilaritySearch*
+Specify how to find molecules from database molecules for individual
+reference molecules or set of reference molecules: Find similar
+molecules or dissimilar molecules from database molecules. Possible
+values: *SimilaritySearch | DissimilaritySearch*. Default value:
+*SimilaritySearch*.
+During *DissimilaritySearch* value of -s, --SearchMode option, the
+meaning of the following options is switched and they correspond to
+dissimilar molecules instead of similar molecules:
+--SimilarCountMode, -n, --NumOfSimilarMolecules,
+--PercentSimilarMolecules, -k, --kNN.
+--SimilarCountMode *NumOfSimilar | PercentSimilar*
+Specify method used to count similar molecules found from database
+molecules for individual reference molecules or set of reference
+molecules: Find number of similar molecules or percent of similar
+molecules from database molecules. Possible values: *NumOfSimilar |
+PercentSimilar*. Default value: *NumOfSimilar*.
+The values for number of similar molecules and percent similar
+molecules are specified using options -n, NumOfSimilarMolecule and
+--PercentSimilarMolecules.
+This option is -s, --SearchMode dependent: It corresponds to
+dissimilar molecules during *DissimilaritySearch* value of -s,
+--SearchMode option.
+--SimilarityCutoff *number*
+Similarity cutoff value to use during comparison of similarity value
+between a pair of database and reference molecules calculated by
+similarity comparison methods for fingerprints bit-vector vector
+strings data values. Possible values: *Any valid number*. Default
+value: *0.75*.
+The comparison value between a pair of database and reference
+molecule must meet the cutoff criterion as shown below:
+SeachMode      CutoffCriterion  ComparisonValues
+Similarity     >=               Higher value implies high similarity
+Dissimilarity  <=               Lower value implies high dissimilarity
+This option is ignored during *No* value of --GroupFusionApplyCutoff
+for *MultipleReferences* -m, --mode.
+This option is -s, --SearchMode dependent: It corresponds to
+dissimilar molecules during *DissimilaritySearch* value of -s,
+--SearchMode option.
+-v, --VectorComparisonMode *SupportedSimilarityName |
+SupportedDistanceName*
+Specify what similarity or distance coefficient to use for
+calculating similarity between fingerprint vector strings data
+values in *ReferenceFingerprintsFile* and *DatabaseFingerprintsFile*
+during similarity search. Possible values: *TanimotoSimilairy | ...
+| ManhattanDistance | ...*. Default value: *TanimotoSimilarity*.
+The value of -v, --VectorComparisonMode, in conjunction with
+--VectorComparisonFormulism, decides which type of similarity and
+distance coefficient formulism gets used.
+The current releases supports the following similarity and distance
+coefficients: *CosineSimilarity, CzekanowskiSimilarity,
+DiceSimilarity, OchiaiSimilarity, JaccardSimilarity,
+SorensonSimilarity, TanimotoSimilarity, CityBlockDistance,
+EuclideanDistance, HammingDistance, ManhattanDistance,
+SoergelDistance*. These similarity and distance coefficients are
+described below.
+FingerprintsVector.pm module, used to calculate similarity and
+distance coefficients, provides support to perform comparison
+between vectors containing three different types of values:
+Type I: OrderedNumericalValues
+. Size of two vectors are same
+. Vectors contain real values in a specific order. For example: MACCS keys
+count, Topological pharmnacophore atom pairs and so on.
+Type II: UnorderedNumericalValues
+. Size of two vectors might not be same
+. Vectors contain unordered real value identified by value IDs. For example:
+Toplogical atom pairs, Topological atom torsions and so on
+Type III: AlphaNumericalValues
+. Size of two vectors might not be same
+. Vectors contain unordered alphanumerical values. For example: Extended
+connectivity fingerprints, atom neighborhood fingerprints.
+Before performing similarity or distance calculations between
+vectors containing UnorderedNumericalValues or AlphaNumericalValues,
+the vectors are transformed into vectors containing unique
+OrderedNumericalValues using value IDs for UnorderedNumericalValues
+and values itself for AlphaNumericalValues.
+Three forms of similarity and distance calculation between two
+vectors, specified using --VectorComparisonFormulism option, are
+supported: *AlgebraicForm, BinaryForm or SetTheoreticForm*.
+For *BinaryForm*, the ordered list of processed final vector values
+containing the value or count of each unique value type is simply
+converted into a binary vector containing 1s and 0s corresponding to
+presence or absence of values before calculating similarity or
+distance between two vectors.
+For two fingerprint vectors A and B of same size containing
+OrderedNumericalValues, let:
+N = Number values in A or B
+Xa = Values of vector A
+Xb = Values of vector B
+Xai = Value of ith element in A
+Xbi = Value of ith element in B
+SUM = Sum of i over N values
+For SetTheoreticForm of calculation between two vectors, let:
+SetIntersectionXaXb = SUM ( MIN ( Xai, Xbi ) )
+SetDifferenceXaXb = SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN ( Xai, Xbi ) )
+For BinaryForm of calculation between two vectors, let:
+Na = Number of bits set to "1" in A = SUM ( Xai )
+Nb = Number of bits set to "1" in B = SUM ( Xbi )
+Nc = Number of bits set to "1" in both A and B = SUM ( Xai * Xbi )
+Nd = Number of bits set to "0" in both A and B
+= SUM ( 1 - Xai - Xbi + Xai * Xbi)
+N = Number of bits set to "1" or "0" in A or B = Size of A or B = Na + Nb - Nc + Nd
+Additionally, for BinaryForm various values also correspond to:
+Na = | Xa |
+Nb = | Xb |
+Nc = | SetIntersectionXaXb |
+Nd = N - | SetDifferenceXaXb |
+| SetDifferenceXaXb | = N - Nd = Na + Nb - Nc + Nd - Nd = Na + Nb - Nc
+=  | Xa | + | Xb | - | SetIntersectionXaXb |
+Various similarity and distance coefficients [ Ref 40, Ref 62, Ref
+64 ] for a pair of vectors A and B in *AlgebraicForm, BinaryForm and
+SetTheoreticForm* are defined as follows:
+CityBlockDistance: ( same as HammingDistance and ManhattanDistance)
+*AlgebraicForm*: SUM ( ABS ( Xai - Xbi ) )
+*BinaryForm*: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
+*SetTheoreticForm*: | SetDifferenceXaXb | - | SetIntersectionXaXb |
+= SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
+CosineSimilarity: ( same as OchiaiSimilarityCoefficient)
+*AlgebraicForm*: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM (
+Xbi ** 2) )
+*BinaryForm*: Nc / SQRT ( Na * Nb)
+*SetTheoreticForm*: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) =
+SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )
+CzekanowskiSimilarity: ( same as DiceSimilarity and
+SorensonSimilarity)
+*AlgebraicForm*: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) +
+SUM ( Xbi **2 ) )
+*BinaryForm*: 2 * Nc / ( Na + Nb )
+*SetTheoreticForm*: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) =
+2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
+DiceSimilarity: ( same as CzekanowskiSimilarity and
+SorensonSimilarity)
+*AlgebraicForm*: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) +
+SUM ( Xbi **2 ) )
+*BinaryForm*: 2 * Nc / ( Na + Nb )
+*SetTheoreticForm*: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) =
+2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
+EuclideanDistance:
+*AlgebraicForm*: SQRT ( SUM ( ( ( Xai - Xbi ) ** 2 ) ) )
+*BinaryForm*: SQRT ( ( Na - Nc ) + ( Nb - Nc ) ) = SQRT ( Na + Nb -
+2 * Nc )
+*SetTheoreticForm*: SQRT ( | SetDifferenceXaXb | - |
+SetIntersectionXaXb | ) = SQRT ( SUM ( Xai ) + SUM ( Xbi ) - 2 * (
+SUM ( MIN ( Xai, Xbi ) ) ) )
+HammingDistance: ( same as CityBlockDistance and ManhattanDistance)
+*AlgebraicForm*: SUM ( ABS ( Xai - Xbi ) )
+*BinaryForm*: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
+*SetTheoreticForm*: | SetDifferenceXaXb | - | SetIntersectionXaXb |
+= SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
+JaccardSimilarity: ( same as TanimotoSimilarity)
+*AlgebraicForm*: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi
+** 2 ) - SUM ( Xai * Xbi ) )
+*BinaryForm*: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na +
+Nb - Nc )
+*SetTheoreticForm*: | SetIntersectionXaXb | / | SetDifferenceXaXb |
+= SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN
+( Xai, Xbi ) ) )
+ManhattanDistance: ( same as CityBlockDistance and HammingDistance)
+*AlgebraicForm*: SUM ( ABS ( Xai - Xbi ) )
+*BinaryForm*: ( Na - Nc ) + ( Nb - Nc ) = Na + Nb - 2 * Nc
+*SetTheoreticForm*: | SetDifferenceXaXb | - | SetIntersectionXaXb |
+= SUM ( Xai ) + SUM ( Xbi ) - 2 * ( SUM ( MIN ( Xai, Xbi ) ) )
+OchiaiSimilarity: ( same as CosineSimilarity)
+*AlgebraicForm*: SUM ( Xai * Xbi ) / SQRT ( SUM ( Xai ** 2) * SUM (
+Xbi ** 2) )
+*BinaryForm*: Nc / SQRT ( Na * Nb)
+*SetTheoreticForm*: | SetIntersectionXaXb | / SQRT ( |Xa| * |Xb| ) =
+SUM ( MIN ( Xai, Xbi ) ) / SQRT ( SUM ( Xai ) * SUM ( Xbi ) )
+SorensonSimilarity: ( same as CzekanowskiSimilarity and
+DiceSimilarity)
+*AlgebraicForm*: ( 2 * ( SUM ( Xai * Xbi ) ) ) / ( SUM ( Xai ** 2) +
+SUM ( Xbi **2 ) )
+*BinaryForm*: 2 * Nc / ( Na + Nb )
+*SetTheoreticForm*: 2 * | SetIntersectionXaXb | / ( |Xa| + |Xb| ) =
+2 * ( SUM ( MIN ( Xai, Xbi ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) )
+SoergelDistance:
+*AlgebraicForm*: SUM ( ABS ( Xai - Xbi ) ) / SUM ( MAX ( Xai, Xbi )
+)
+*BinaryForm*: 1 - Nc / ( Na + Nb - Nc ) = ( Na + Nb - 2 * Nc ) / (
+Na + Nb - Nc )
+*SetTheoreticForm*: ( | SetDifferenceXaXb | - | SetIntersectionXaXb
+| ) / | SetDifferenceXaXb | = ( SUM ( Xai ) + SUM ( Xbi ) - 2 * (
+SUM ( MIN ( Xai, Xbi ) ) ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM (
+MIN ( Xai, Xbi ) ) )
+TanimotoSimilarity: ( same as JaccardSimilarity)
+*AlgebraicForm*: SUM ( Xai * Xbi ) / ( SUM ( Xai ** 2 ) + SUM ( Xbi
+** 2 ) - SUM ( Xai * Xbi ) )
+*BinaryForm*: Nc / ( ( Na - Nc ) + ( Nb - Nc ) + Nc ) = Nc / ( Na +
+Nb - Nc )
+*SetTheoreticForm*: | SetIntersectionXaXb | / | SetDifferenceXaXb |
+= SUM ( MIN ( Xai, Xbi ) ) / ( SUM ( Xai ) + SUM ( Xbi ) - SUM ( MIN
+( Xai, Xbi ) ) )
+--VectorComparisonFormulism *AlgebraicForm | BinaryForm |
+SetTheoreticForm*
+Specify fingerprints vector comparison formulism to use for
+calculation similarity and distance coefficients during -v,
+--VectorComparisonMode. Possible values: *AlgebraicForm | BinaryForm
+| SetTheoreticForm*. Default value: *AlgebraicForm*.
+For fingerprint vector strings containing AlphaNumericalValues data
+values - ExtendedConnectivityFingerprints,
+AtomNeighborhoodsFingerprints and so on - all three formulism result
+in same value during similarity and distance calculations.
+-w, --WorkingDir *DirName*
+Location of working directory. Default: current directory.
+EXAMPLES
+To perform similarity search using Tanimoto coefficient by treating all
+reference molecules as a set to find 10 most similar database molecules
+with application of Max group fusion rule and similarity cutoff to
+supported fingerprints strings data in SD fingerprints files present in
+a data fields with Fingerprint substring in their labels, and create a
+ReferenceFPHexSimilaritySearching.csv file containing sequentially
+generated database compound IDs with Cmpd prefix, type:
+% SimilaritySearchingFingerprints.pl -o ReferenceSampleFPHex.sdf
+DatabaseSampleFPHex.sdf
+To perform similarity search using Tanimoto coefficient by treating all
+reference molecules as a set to find 10 most similar database molecules
+with application of Max group fusion rule and similarity cutoff to
+supported fingerprints strings data in FP fingerprints files, and create
+a SimilaritySearchResults.csv file containing database compound IDs
+retireved from FP file, type:
+% SimilaritySearchingFingerprints.pl -r SimilaritySearchResults -o
+ReferenceSampleFPBin.fpf DatabaseSampleFPBin.fpf
+To perform similarity search using Tanimoto coefficient by treating all
+reference molecules as a set to find 10 most similar database database
+molecules with application of Max group fusion rule and similarity
+cutoff to supported fingerprints strings data in text fingerprints files
+present in a column names containing Fingerprint substring in their
+names, and create a ReferenceFPHexSimilaritySearching.csv file
+containing database compound IDs retireved column name containing
+CompoundID substring or sequentially generated compound IDs, type:
+% SimilaritySearchingFingerprints.pl -o ReferenceSampleFPCount.csv
+DatabaseSampleFPCount.csv
+To perform similarity search using Tanimoto coefficient by treating
+reference molecules as individual molecules to find 10 most similar
+database molecules for each reference molecule with application of
+similarity cutoff to supported fingerprints strings data in SD
+fingerprints files present in a data fields with Fingerprint substring
+in their labels, and create a ReferenceFPHexSimilaritySearching.csv file
+containing sequentially generated reference and database compound IDs
+with Cmpd prefix, type:
+% SimilaritySearchingFingerprints.pl -mode IndividualReference -o
+ReferenceSampleFPHex.sdf DatabaseSampleFPHex.sdf
+To perform similarity search using Tanimoto coefficient by treating
+reference molecules as individual molecules to find 10 most similar
+database molecules for each reference molecule with application of
+similarity cutoff to supported fingerprints strings data in FP
+fingerprints files, and create a ReferenceFPHexSimilaritySearching.csv
+file containing references and database compound IDs retireved from FP
+file, type:
+% SimilaritySearchingFingerprints.pl -mode IndividualReference -o
+ReferenceSampleFPHex.fpf DatabaseSampleFPHex.fpf
+To perform similarity search using Tanimoto coefficient by treating
+reference molecules as individual molecules to find 10 most similar
+database molecules for each reference molecule with application of
+similarity cutoff to supported fingerprints strings data in text
+fingerprints files present in a column names containing Fingerprint
+substring in their names, and create a
+ReferenceFPHexSimilaritySearching.csv file containing reference and
+database compound IDs retrieved column name containing CompoundID
+substring or sequentially generated compound IDs, type:
+% SimilaritySearchingFingerprints.pl -mode IndividualReference -o
+ReferenceSampleFPHex.csv DatabaseSampleFPHex.csv
+To perform dissimilarity search using Tanimoto coefficient by treating
+all reference molecules as a set to find 10 most dissimilar database
+molecules with application of Max group fusion rule and similarity
+cutoff to supported fingerprints strings data in SD fingerprints files
+present in a data fields with Fingerprint substring in their labels, and
+create a ReferenceFPHexSimilaritySearching.csv file containing
+sequentially generated database compound IDs with Cmpd prefix, type:
+% SimilaritySearchingFingerprints.pl --mode MultipleReferences --SearchMode
+DissimilaritySearch -o ReferenceSampleFPHex.sdf DatabaseSampleFPHex.sdf
+To perform similarity search using CityBlock distance by treating
+reference molecules as individual molecules to find 10 most similar
+database molecules for each reference molecule with application of
+distance cutoff to supported vector fingerprints strings data in SD
+fingerprints files present in a data fields with Fingerprint substring
+in their labels, and create a ReferenceFPHexSimilaritySearching.csv file
+containing sequentially generated reference and database compound IDs
+with Cmpd prefix, type:
+% SimilaritySearchingFingerprints.pl -mode IndividualReference
+--VectorComparisonMode CityBlockDistance --VectorComparisonFormulism
+AlgebraicForm --DistanceCutoff 10 -o
+ReferenceSampleFPCount.sdf DatabaseSampleFPCount.sdf
+To perform similarity search using Tanimoto coefficient by treating all
+reference molecules as a set to find 100 most similar database molecules
+with application of Mean group fusion rule to to top 10 reference
+molecules with in similarity cutoff of 0.75 to supported fingerprints
+strings data in FP fingerprints files, and create a
+ReferenceFPHexSimilaritySearching.csv file containing database compound
+IDs retrieved from FP file, type:
+% SimilaritySearchingFingerprints.pl --mode MultipleReferences --SearchMode
+SimilaritySearch --BitVectorComparisonMode TanimotoSimilarity
+--GroupFusionRule Mean --GroupFusionApplyCutoff Yes --kNN 10
+--SimilarityCutoff 0.75 --SimilarCountMode NumOfSimilar
+--NumOfSimilarMolecules 100 -o
+ReferenceSampleFPHex.fpf DatabaseSampleFPHex.fpf
+To perform similarity search using Tanimoto coefficient by treating
+reference molecules as individual molecules to find 2 percent of most
+similar database molecules for each reference molecule with application
+of similarity cutoff of 0.85 to supported fingerprints strings data in
+text fingerprints files present in specific columns and create a
+ReferenceFPHexSimilaritySearching.csv file containing reference and
+database compoundIDs retrieved from specific columns, type:
+% SimilaritySearchingFingerprints.pl --mode IndividualReference --SearchMode
+SimilaritySearch --BitVectorComparisonMode TanimotoSimilarity
+--ReferenceColMode ColLabel --ReferenceFingerprintsCol Fingerprints
+--ReferenceCompoundIDCol CompoundID --DatabaseColMode Collabel
+--DatabaseCompoundIDCol CompoundID --DatabaseFingerprintsCol
+Fingerprints --SimilarityCutoff 0.85 --SimilarCountMode PercentSimilar
+--PercentSimilarMolecules 2 -o
+ReferenceSampleFPHex.csv DatabaseSampleFPHex.csv
+To perform similarity search using Tanimoto coefficient by treating
+reference molecules as individual molecules to find top 50 most similar
+database molecules for each reference molecule with application of
+similarity cutoff of 0.85 to supported fingerprints strings data in SD
+fingerprints files present in specific data fields and create both
+ReferenceFPHexSimilaritySearching.csv and
+ReferenceFPHexSimilaritySearching.sdf files containing reference and
+database compoundIDs retrieved from specific data fields, type:
+% SimilaritySearchingFingerprints.pl --mode IndividualReference --SearchMode
+SimilaritySearch --BitVectorComparisonMode TanimotoSimilarity
+--ReferenceFingerprintsField Fingerprints
+--DatabaseFingerprintsField Fingerprints
+--ReferenceCompoundIDMode DataField --ReferenceCompoundIDField CmpdID
+--DatabaseCompoundIDMode DataField --DatabaseCompoundIDField CmpdID
+--SimilarityCutoff 0.85 --SimilarCountMode NumOfSimilar
+--NumOfSimilarMolecules 50 --output both -o
+ReferenceSampleFPHex.sdf DatabaseSampleFPHex.sdf
+To perform similarity search using Tanimoto coefficient by treating
+reference molecules as individual molecules to find 1 percent of most
+similar database molecules for each reference molecule with application
+of similarity cutoff to supported fingerprints strings data in SD
+fingerprints files present in specific data field labels, and create
+both ReferenceFPHexSimilaritySearching.csv
+ReferenceFPHexSimilaritySearching.sdf files containing reference and
+database compound IDs retrieved from specific data field labels along
+with other specific data for database molecules, type:
+% SimilaritySearchingFingerprints.pl --mode IndividualReference --SearchMode
+SimilaritySearch --BitVectorComparisonMode TanimotoSimilarity
+--ReferenceFingerprintsField Fingerprints
+--DatabaseFingerprintsField Fingerprints
+--ReferenceCompoundIDMode DataField --ReferenceCompoundIDField CmpdID
+--DatabaseCompoundIDMode DataField --DatabaseCompoundIDField CmpdID
+--DatabaseDataFieldsMode Specify --DatabaseDataFields "TPSA,SLogP"
+--SimilarityCutoff 0.75 --SimilarCountMode PercentSimilar
+--PercentSimilarMolecules 1 --output both --OutDelim comma --quote Yes
+--precision 3 -o ReferenceSampleFPHex.sdf DatabaseSampleFPHex.sdf
+AUTHOR
+Manish Sud <msud@san.rr.com>
+SEE ALSO
+InfoFingerprintsFiles.pl, SimilarityMatricesFingerprints.pl,
+AtomNeighborhoodsFingerprints.pl, ExtendedConnectivityFingerprints.pl,
+MACCSKeysFingerprints.pl, PathLengthFingerprints.pl,
+TopologicalAtomPairsFingerprints.pl,
+TopologicalAtomTorsionsFingerprints.pl,
+TopologicalPharmacophoreAtomPairsFingerprints.pl,
+TopologicalPharmacophoreAtomTripletsFingerprints.pl
+COPYRIGHT
+Copyright (C) 2015 Manish Sud. All rights reserved.
+This file is part of MayaChemTools.
+MayaChemTools is free software; you can redistribute it and/or modify it
+under the terms of the GNU Lesser General Public License as published by
+the Free Software Foundation; either version 3 of the License, or (at
+your option) any later version.

Mercurial > repos > deepakjadmin > mayatool3_test2

comparison docs/scripts/txt/SimilaritySearchingFingerprints.txt @ 0:4816e4a8ae95 draft default tip