371 Similarity

Download 371 Similarity

Post on 02-Jan-2016




0 download

Embed Size (px)


  • Similarity MethodsC371Fall 2004

  • Limitations of Substructure Searching/3D Pharmacophore SearchingNeed to know what you are looking forCompound is either there or notDont get a feel for the relative ranking of the compoundsOutput size can be a problem

  • Similarity SearchingLook for compounds that are most similar to the query compoundEach compound in the database is rankedIn other application areas, the technique is known as pattern matching or signature analysis

  • Similar Property PrincipleStructurally similar molecules usually have similar properties, e.g., biological activityKnown also as neighborhood behaviorExamples: morphine, codeine, heroinDefine: in silicoUsing computational techniques as a substitute for or complement to experimental methods

  • Advantages of Similarity SearchingOne known active compound becomes the search keyUser sets the limits on outputPossible to re-cycle the top answers to find other possibilitiesSubjective determination of the degree of similarity

  • Applications of Similarity SearchingEvaluation of the uniqueness of proposed or newly synthesized compoundsFinding starting materials or intermediates in synthesis designHandling of chemical reactions and mixturesFinding the right chemicals for ones needs, even if not sure what is needed.

  • Subjective Nature of Similarity SearchingNo hard and fast rulesNumerical descriptors are used to compare moleculesA similarity coefficient is defined to quantify the degree of similaritySimilarity and dissimilarity rankings can be different in principle

  • Similarity and DissimilarityConsider two objects A and B, a is the number of features (characteristics) present in A and absent in B, b is the number of features absent in A and present in B, c is the number of features common to both objects, and d is the number of features absent from both objects. Thus, c and d measure the present and the absent matches, respectively, i.e., similarity; while a and b measure the corresponding mismatches, i.e., dissimilarity. (Chemoinformatics; A Textbook (2003), p. 304)

  • 2D Similarity MeasuresCommonly based on fingerprints, binary vectors with 1 indicating the presence of the fragment and 0 the absenceCould relate structural keys, hashed fingerprints, or continuous data (e.g., topological indexes that take into acount size, degree of branching, and overall shape)

  • Tanimoto CoefficientTanimoto Coefficient of similarity for Molecules A and B:SAB = c _ a + b ca = bits set to 1 in A, b = bits set to 1 in B, c = number of 1 bits common to bothRange is 0 to 1.Value of 1 does not mean the molecules are identical.

  • Similarity CoefficientsTanimoto coefficient is most widely used for binary fingerprintsOthers:Dice coefficientCosine similarityEuclidean distanceHamming distanceSoergel distance

  • Distance Between Pairs of MoleculesUsed to define dissimilarity of moleculesRegards a common absence of a feature as evidence of similarity

  • When is a distance coefficient a metric?Distance values must be zero or positiveDistance from an object to itself must be zeroDistance values must be symmetricDistance values must obey the triangle inequality: DAB DAC + DBCDistance between non-identical objects must be greater than zero.Dissimilarity = distance in the n-dimensional descriptor space

  • Size Dependency of the MeasuresSmall molecules often have lower similarity values using TanimotoTanimoto normalizes the degree of size in the denominator:SAB = c _ a + b c

  • Other 2D Descriptor MethodsSimilarity can be based on continuous whole molecule properties, e.g. logP, molar refractivity, topological indexes.Usual approach is to use a distance coefficient, such as Euclidean distance.

  • Maximum Common Subgraph SimilarityAnother approach: generate alignment between the molecules (mapping)Define MCS: largest set of atoms and bonds in common between the two structures.A Non-Polynomial- (NP)-complete problem: very computer intensive; in the worst case, the algorithm will have an exponential computational complexityTricks are used to cut down on the computer usage

  • Maximum Common Subgraph

  • Reduced Graph SimilarityA structures key features are condensed while retaining the connections between themCen ID structures with similar binding characteristics, but different underlying skeletonsSmaller number of nodes speeds up searching

  • 3D SimilarityAim is often to identify structurally different molecules3D methods require consideration of the conformational properties of molecules

  • Tanimoto Coefficient to Find Compounds Similar to Morphine

  • 3D: Alignment-Independent MethodsDescriptors: geometric atom pairs and their distances, valence and torsion angles, atom tripletsConsideration of conformational flexibility increases greatly the compute timeRelatively fewer pharmacophoric fingerprints than 2D fingerprintsResult: Low similarity values using Tanimoto

  • PharmacophoreA structural abstraction of the interactions between various functional group types in a compoundDescribed by a spatial representation of these groups as centers (or vertices) of geometrical polyhedra, together with pairwise distances between centershttp://www.ma.psu.edu/~csb15/pubs/searle.pdf

  • 3D: Alignment MethodsRequire consideration of the degrees of freedom related to the conformational flexibility of the moleculesGoal: determine the alignment where similarity measure is at a maximum

  • 3D: Field-Based Alignment MethodsConsideration of the electron density of the moleculesRequires quantum mechanical calculation: costlyProperty not sufficiently discriminatory

  • 3D: Gnomonic Projection MethodsMolecule positioned at the center of a sphere and properties projected on the surfaceSphere approximated by a tessellated icosahedron or dodecahedronEach triangular face is divided into a series of smaller triangles

  • Finding the Optimal AlignmentNeed a mechanism for exploring the orientational (and conformational) degrees of freedon for determining the optimal alignment where the similarity is maximizedMethods: simplex algorithm, Monte Carlo methods, genetic alrogithms

  • Evaluation of Similarity MethodsGenerally, 2D methods are more effective that 3D2D methods may be artificially enhanced because of database characteristics (close analogs)Incomplete handling of conformational flexibility in 3D databasesBest to use data fusion techniques, combining methods

  • For additional information . . .See Dr. John Barnards lecture at:http://www.indiana.edu/~cheminfo/C571/c571_Barnard6.ppt