DataWarrior User Manual

Molecule Similarity and Descriptors

Similarities values between molecules play an important role in DataWarrior. They are used to filter compounds or to customize views, e.g. to color and position markers based on compound similarities. Some alpha-numerical data analysis features also work based on compound similarities, e.g. self organizing maps. In addition many dedicated cheminformatics analysis methods require a similarity criterion for molecules, e.g. activity cliff analysis, compound clustering, and more. DataWarrior supports various kinds of molecule similarities. These range from a simple chemical similarity based on shared substructure fragments up to a biological similarity that considers 3D-geometry and binding behavior.

When DataWarrior calculates a similarity value between two molecules, then the calculation is not performed on the molecular graphs directly. It rather involves a two step process:

  • The molecular graph of every molecule is processed to extract certain molecule features. These are compiled into an abstract molecule description, called descriptor. In the simplest case the descriptor consists of a binary array of which every bit indicates, whether a certain feature is present in the molecule. Binary descriptors are also referred to as fingerprints.
  • In a second step the descriptors of two molecules are compared to reveal the similarity value, i.e. how much the two compounds have in common. In case of a binary descriptor one usually devides the number of common features by the number of features being available in both two molecules. This is referred to as Tanimoto similarity.
  • Naturally, the kind of features being collected has a substantial influence on the kind of similarity calculated. DataWarrior supports three different binary descriptors and three more advanced similarity methods, which are all explained in more detail.

    Which Descriptor Should Be Used For Which Purpose

    If the purpose is to filter a large compound collection by chemical structure similarity, the default descriptor FragFp is a good choice, because it is automatically available, it does not require much space and similarity calculations are practically instantanious.

    If more fine grained similaries need to be perceived, e.g. if stereo isomers need to be distinguished or to achieve best results from clustering or any kind of similarity analysis, then the SkelSpheres descriptor should be taken. Especially, when creating an evolutionary library in vast virtual compound space, then the SkelSpheres descriptor outperformes the binary fingerprints in quality, because it considers multiple fragment occurence and makes hash collisions unlikely.

    When chemical functionality from a synthetic chemist's point of view is more important than the carbon skeleton that carries this functionality, the you should try the OrgFunctions descriptor. Examples are searching a chemicals database for an alternative reactant to a reaction or arranging a building blocks collection in space based synthetically accessible functionality.

    If the similarity of biological binding behaviour is key rather than merely the similarity of the chemical graph, then use the Flexophore descriptor, which requires more space and significantly more time to calculate descriptors as well as similarity values.

    The FragFp Descriptor

    DataWarrior's default descriptor FragFp is a substructure fragment dictionary based binary fingerprint similar to the MDL keys. It relies on a dictionary of 512 predefined structure fragments. These were selected from the multitude of possible structure fragments by optimizing two criteria: All chosen fragments should occurr frequently withing typical organic molecule structures. Any two chosen fragments should show little overlap concerning their occurrence in diverse sets of organic compounds. The FragFp descriptor contains 1 bit for every fragment in the dictionary. A bit set to 1 if the corresponding fragment is present in the molecule at least on time. In about half of the fragments all hetero atoms have been replaced by wild cards. This way single atom replacements only cause a moderate drop of similarity, which reflects a chemists natural similarity perception.

    In addition to calculating molecule similarities DataWarrior uses the FragFp descriptor for a second purpose: The acceleration of the sub-structure filtering. Since a sub-structure search is effectively a graph matching algorithm and therefore computationally rather demanding, DataWarrior employs a pre-screening step that can quickly exclude most compounds from the graph-matching. In this step DataWarrior determines a list of all dictionary fragments, which are part of the sub-structure query. Molecules that don't contain all of the query's fragments cannot contain the query itself. Therefore, these are skipped in the graph-matching phase.

    The PathFp Descriptor

    The PathFp descriptor encodes any linear strand of up to 7 atoms into a hashed binary fingerprint of 512 bits. Therefore, every path of 7 or less atoms in the molecule is located. In a normalized way an identifying text string is constructed from every path that encodes atomic numbers and bond orders. From the text string a hash value is created, which is used to set the respective bit of the fingerprint. The PathFp descriptor is conceptually very similar to the 'folded fingerprints' that software of Daylight Inc. uses for calculating chemical similarities.

    The SphereFp Descriptor

    The SphereFp descriptor encodes circular spheres of atoms and bonds into a hashed binary fingerprint of 512 bits. From every atom in the molecule, DataWarrior constructs fragments of increasing size by including n layers of atom neighbours (n=1 to 5). These circular fragments are canonicalized considering aromaticity, but neglecting stereo configurations. From the canonical representation a hash code is generated, which is used to set the respective bit of the fingerprint. In the literature spherical fingerprints are sometimes referred to as HOSE codes and are in use for spectroscopy prediction.

    The SkelSpheres Descriptor

    When a more subtle structural similarity is needed, the SkeletonSpheres descriptor should be used. It is related to the SphereFp, but also considers stereochemistry, counts duplicate fragments, in addition encodes hetero-atom depleted skeletons, and has twice the resolution leading to less hash collisions. It is the most accurate descriptor for calculating similarities of chemical graphs. On the flipside it needs more memory and similarity calculations take slightly longer. Technically, it is a byte vector with a resolution of 1024 bins.

    The OrgFunctions Descriptor

    The OrgFunctions descriptor perceives molecules with the focus on available funtional groups from a synthetic chemist's point of view. It also recognizes the steric or electronic features of the neighborhood of the functional groups. It perceives molecules as being very similar, if they carry the same functional groups in similar environments independent of the rest of the carbon skeletons.

    The OrgFunctions descriptor is neither a fingerprint nor an integer vector. It rather stores all synthetically accessible functions of the molecule in a finely grained way. DataWarrior distinguishes 1024 core functions, which typically overlap. Butenone for instance is recognized as vinyl-alkyl-ketone as well as a carbonyl-activated terminal alkene. All 1024 functional groups are organized in a tree structure that permits deriving similarities between related functions. These are taken into account, when the similarity between two molecules, i.e. OrgFunctions descriptors, is calculated.

    The Flexophore Descriptor

    The Flexophore descriptor allows predicting 3D-pharmacophore similarities. It provides an easy-to-use and yet powerful way to check, whether any two molecules may have a compatible protein binding behavior. A high Flexophore similarity indicates that a significant fraction of conformers of both molecules are similar concerning shape, size, flexibility and pharmacophore points. Different from common 3D-pharmacophore approaches, this descriptor matches entire conformer sets rather than comparing individual conformers, leading to higher predictability and taking molecular flexibility into account.

    The calculation of the Flexophore descriptor is computationally quite demanding. For a given molecule it starts with the creation of a representative set of up to 250 conformers using a self organization based algorithm to construct small rigid molecule fragments, which are then connected with likely torsion angles. This conformer generation approach balances high diversity and conformer likelyhood. Then, the atoms of the underlying molecule are detected and classified, which have the potential to interact with protein atoms in any way. De-facto an enhanced MM2 atom type is used to describe these atoms as interaction point. In some cases multiple atoms contribute to one summarized interaction point, e.g. in aromatic rings.

    Sample molecule with assigned interaction points

    A molecule's Flexophore descriptor now consists of a reduced, but complete graph of the original molecule with the interaction points being considered graph nodes. A graph edge between two nodes is encoded as a distance histogram between these nodes over all conformers. Since the Flexophore descriptor is a complete graph, every combination of any two nodes is encoded and stored as part of the descriptor. Thus, the descriptor creation as well as the similarity calculation from two descriptors depend heavily on the number of interaction points in each of them.

    Complete graph; distance histogram of highlighted edge

    The calculation of the similarity between two Flexophore descriptors involves a graph matching algorithm that not only tries to match the largest possible subgraphs, but also tries to maximize edge and node similarities. Edge similarities are derived from the distance histogram overlaps and node similarities are taken from a interaction point (extended MM2 atom type) similarity matrix, which was originally derived from a ligand-protein interaction analysis of the PDB database.