DataWarrior User Manual

Working With Chemical Structures


DataWarrior was designed from the outset as a chemistry aware data analysis and visualization platform. Its built-in chemistry intelligence allows working with chemical structures as easily as with alphanumerical data. Rows may be filtered based on whether molecules contain certain sub-structures or on various kinds of molecule similarities. Views may display molecular structures as labels or on the axes. Various kinds of molecule similarities can be translated into marker positions, sizes or colors. Data analysis methods like the principal component analysis or self organizing maps can be applied to chemical structures equally well as to alphanumerical data.

Dedicated cheminformatics functionality provide for state-of-the-art analysis methods from the mere similarity comparison of two molecule lists to more advanced methods as an activity cliff analysis, SAR-table creation or the enumeration of virtual combinatorial substance libraries. 2-dimensional scaling methods help to visualize compound collections and the prediction of many compound properties to support the characterization and selection of compounds.


The Structure Filter

If a DataWarrior file contains chemical structures, then usually the filter panel contains one or some Structure Filters. In any case one can always add new Structure Filters by selecting New Filter... from the Edit menu.

A Structure Filter option menu allows to switch between two general operation modes:
  • contains hides compounds (not) having a certain a sub-structure
  • is similar to [...] hides rows with (dis-)similar structures. The descriptor used is given in brackets.

  • A double-click on the filter's structure field opens the Osiris Structure Editor. Please note that the behavior of the editor differs between the sub-structure search and the similarity search. In the former case you edit a potentially substituted chemical fragment that may contain query features while in the latter case you define a complete molecule.

    Structure editor in molecule mode.

    To edit a structure one typically selects a tool from the buttons on the left and uses the mouse to apply the tool in the right structure area. The keyboard can be used to accelerate the drawing process, e.g. typing a digit adds a chain of n atoms or changes a bond order depending on whether the mouse pointer is on top of an atom or a bond. '+' or '-' change an atom charge and Del removes selected atoms. Typing one or more letters changes an atom type. More options to modify an existing atom are available in the atom dialog, which opens after a double click on an atom. A single click just applies the defined options to the clicked atom.

    Atom dialog to define isotops, abnormal valences and custom atom labels.

    Most of the structure editor's buttons are rather self explanatory, while some need a little more explanation. For a detailed description of the editor please refer to the Structure Editor section.


    Substructure Search

    A filter in substructure search mode hides all rows, whose structures do not contain the drawn molecule fragment. Required atom or bond features of the query fragment can be defined to narrow the search. A double click on an atom or bond after selecting the lasso tool opens an corresponding atom or bond query feature dialog. If multiple atoms or bonds have been selected, then the dialog's settings are applied to all selected atoms or bonds when the dialog is closed.


    Atom query feature dialog of structure editor in fragment mode.

    The first option in the atom query feature dialog lets you convert a defined atom into a wild-card. In the following field allowed atoms one may specify more precisely, which atom types are allowed at this position. If any atomic number is selected and therefore the atom is declared a wild-card then the allowed atoms field turns into an excluded atoms field. This causes the struture search to accept any atom but the explicitly excluded ones. While these options make the query fragment less specific, all following options narrow the search. Most of these options are self explanatory.

    The option is part of an exclude group is a powerful feature that goes well beyond a standard substructure search. This option lets you mark a part of a query structure to represent an unwanted structural feature. This way you may search for amines (nitrogen atoms without an attached carbonyl group), for single fluoro substituent that is not part of a di- or tri-fluoro substitution, or for pyridyl-substituents without a methoxy in meta-position, etc. Exclude groups are also very useful in the context of generic reactions to build combinatorial libraries.

    Exclude groups are drawn in dark red with a pink background.

    If a query fragment contains an exclude group then the substructure search process runs in two steps. First it locates all matches of the query fragment without exclude group. If multiple matches are found, this can have two reasons: the query fragment may exist multiple times in a molecule or the query fragment is symmetrical and matches the same atoms multiple times with different query-to-molecule atom assignments. The latter case we call equivalent matches. The second step of the substructure search now checks any original match, whether it can be extended to also include the exclude group. If the exclude group can be matched, then the original match and all equivalent matches are discarded. For the m-methoxy-pyridin example above this means that a molecule containing a pyridin fragment is not considered a match if any of the two meta positions carries a methoxy substituent.

    In the Structure Editor atom or bond query features are usually reflected in the drawing in an obvious way or indicated with small letters (s:further substituted, a:aromatic, h:hydrogen count, c:charge, n:neighbour count, r:ring size, etc.). Some complex query features may not be shown directly. In any case atoms and bonds with query features show a yellow spot in their background to indicate their special meaning.

    Query fragment with yellow markers to indicate invisible query features.

    In the bond query features dialog one may select more than one bond types, e.g. Single and Delocalized as in the example below. This would match the query bond to both, single and delocalized bonds. It is important to understand that DataWarrior considers 5-membered aromatic rings not to be delocalized. Therefore, the Delocalized option does not match to any bond of an aromatic 5-membered ring, unless it is annelated to an aromatic 6-membered ring, which is considered delocalized and which causes the shared bond between the two rings to be delocalized as well. In general DataWarrior considers aromatic bonds with no preferred mesomeric structure as being delocalized.

    Bond query feature dialog of structure editor in fragment mode.

    An Atom Bridge option allows to convert any bond of the query structure to represent a chain of multiple connected atoms rather than a direct connection. One defines the allowed number of atoms within the chain not counting the atoms already drawn as part of the query structure.

    If Highlight Structure By -> Recent Filter is selected in column header menu of the structure column in the Table View, then the query fragments as part of the entire molecule are drawn in dark red, while the rest of the displayed structures is usually drawn in black.


    Similarity search

    For focusing on structurally similarity compounds rather than on compounds sharing a specific sub-structure, select any of the is similar to [...] options. Then double-click the filter's structure area and draw a molecule using the Structure Editor or drag and drop a molecule from somewhere else to the filter. You may adjust the similarity level to your needs. Typically, chemists will perceive molecules as very similar if their similarity value is about 0.90 or above.

    If Highlight Structure By -> Recent Filter is selected in column header menu of the structure column in the Table View, then structural elements that are shared between displayed structures and the query structure are highlighted in green.

    Per default Datawarrior calculates a FragFp descriptor of the first structure column within the data table. This descriptor can be used to calculate similarities between molecules. The FragFp similarity between two molecules is the number of fragments that both molecules have in common devided by the number of fragments being found in any of the two molecules. If your dataset contains other descriptors, then your filter menu contains associated is similar to [...] options that you may choose to filter by another similarity criterion. Other descriptors can be calculated by choosing Chemistry->Add Descriptor->... from the menu. Once the calculation has been finished, the associated similarity option gets available in the filter menu. Here you can find more information on descriptors and similarities and which kind of similarity should be used for which purpose.


    Search & Replace with Chemical Structures

    Usually, Search & Replace functionality is performed on text data. For this purpose the Search & Replace dialog (left) lets you select a column, a text string to be searched for and another piece of text as a replacement string. You may also define, whether the search target shall be replaced in all rows, in all visible rows, or in the selected rows only.

    Search & Replace dialogs for alphanumerial and structure columns.

    If the target column doesn't contain text but chemical structures, then the dialog changes and shows two substructure fields instead of text fields to define the search and replacement fragments (dialog on the right). For replacing one substructure by another, DataWarrior needs to know, which attachment points on the search target match which atoms on the replacement fragment. This is defined by adding R-groups to both fragments. In oder to replace a phenyl group with a cyclohexyl substituent, one needs to draw phenyl-R1 and cyclohexyl-R1. Atoms of the search fragment, which don't carry an R-group, are considered not to carry any more substituents. To define linkers or scaffolds instead of single bonded substituents one needs to attach two or more R-groups to the search target and replacement fragment.


    Unifying 2D Atom Coordinates

    Typically, every column that contains chemical structures is accompanied by an invisible column with 2-dimensional atom coordinates, which allow DataWarrior to draw chemical structures in their original orientation. When no 2-dimensional atom coordinates exist, e.g. if an input file contains 3-dimensional coordinates only or if the structures were created from Smiles codes, then DataWarrior creates atom coordinates on the fly whenever a structure is displayed. If original atom coordinates are not satisfactory or if molecules with shared scaffolds shall always be drawn with the same scaffold orientation, then one should (re-)generate new atom coordinates for a given structure column. To create new atom coordinates select Generate 2D Atom Coordinates... from the Chemistry menu. A dialog opens where one can choose the structure column to be used.

    Options for 2D atom coordinate calculation and unification.

    If no further options are selected and OK is pressed, then DataWarrior tries to generate atom coordinates for every structure individually. If, however, as in the example above, some scaffold are defined, then DataWarrior checks every structure of the selected column, whether it contains any of the given scaffolds. If a scaffold is found then the corresponding atoms' coordinates are copied from the scaffold and the remaining atoms' coordinates are optimized around the scaffold without touching the scaffold's orientation.

    If Automatically detect scaffolds... is selected, then DataWarrior still considers manually defined scaffolds, if there are any. In addition it processes all structures, which don't contain manually defined scaffolds. From these structures it compiles a unique list of unsubstituted scaffolds. Then it generates atom coordinates for every one of these scaffolds. Afterwards it processes all structures again. Every structure that contains a scaffold of the list will receive new atom coordinates that contain the scaffold's optimized coordinates. This ensures that compounds based on the same scaffold are always drawn in the same orientation.
    The algorithm used for automatic scaffold location can be selected from among these:

    • Most central ring system: Imagine one removes all atoms and bonds that are not part of any ring of a given molecule. In this case we retain all separated and unsubstituted ring systems of this molecule. The most central ring system is the one that is topologically closest to the center of the molecule.
    • Murcko scaffold: If we locate all ring systems of a molecule and add all atoms and bonds that directly connect different ring systems, and remove all other bonds and atoms, we retain the so-called Murcko scaffold. Basically it is the original molecule with all substituents removed that do not contain any ring.
    Both options require the existence of at least one ring in the molecule. In molecules that don't contain rings no scaffold is found. Hence, newly created coordinates for these molecules won't be influenced by the list of found scaffolds.


    Comparing Structure Files

    Sometimes the need arises to compare two rather big sets of compounds for overlaps, i.e. for compounds in one set, which are matched by similar compounds from the other set. One use case may be locating commercially available compounds, which are dissimilar to any compound of an existing screening library. In a second step one might filter these compounds according to a desired property profile and then purchase a diverse subset of these compounds. Another example may be a virtual screening for those commercially available compounds, which are similar to at least one of a known set of bioactive compounds. Both, chemical or flexophore similarity may be useful in this case.

    This task compares all compounds of the currently active DataWarrior window with all compounds of a specified compound file. The open window will receive a new column containing the most similar structure of the other file and another new column containing the similarity value. In addition one may select more information from the external file to be included into this window. One may also create new files containing all similar structures or all dissimilar structure from the external file. As a last option one may write all compared compound pairs or just the similar ones into a new file.

    Options for locating (dis-)similar compounds between two files.

    File: Click Choose to select the structure file, which the active window's structures shall be compared to. This file may actually be the same file that was opened to show the active window.

    Descriptor: The descriptor from which to calculate similarity values. Only descriptors existing in your open windows are shown. If the external file does not contain the selected descriptor then it is calculated on the fly.

    Similarity limit: This defines the similarity threshold above which two compounds are considered similar. If the most similar compound is not above this limit, it will not be shown in the new columns of the active window. If a compound pair file is written, then this will also not include pairs, whose similarity is below this threshold.

    Select columns: This allows to select columns from the compared file, which shall be included as new columns into the current window and populated with the respective information of the most similar compound for each row.

    Save similar compounds to file: Select this to define a file name for storing all compounds of the external file that are found to be similar to any of the current window's compounds.

    Save dissimilar compounds to file: Select this to define a file name for storing all compounds of the external file that are not found to be similar to any of the current window's compounds.

    Save similar compound pairs to file: Select this to define a file name for storing all compound pairs that are more similar than the defined similarity limit.

    Compound-ID of this dataset: Here you may select a column from the active window containing compound identifiers, which will be written into the compound pair file.

    Compound-ID of external file: Here you may select a column from the external file containing compound identifiers, which will be written into the compound pair file.

    Create half matrix: This option is only available, if the active window's content was read from the same file that it is compared against. If this option is selected, then only half of the similarity matrix is processed, because this already covers all possible compound pair combinations.


    Selecting Diverse Compounds from Large Set

    This function is an efficient implementation for locating a most diverse subset within a given set of molecules. The algorithm can be preloaded with a second set of molecules, causing the algorithm to select molecules, which are both, most different to any molecule in the secons set and highly diverse among the selection. Especially for this reason, this function is perfectly suited to select diverse screening compounds from a provider's catalog avoiding any compound being similar to already available in-house compounds.

    Dialog configured to select 50000 diverse compound different to currentLibrary.dwar.

    All binary descriptors can be used with this algorithm. After computing the desired number of diverse compounds a column is added to the dataset with ascending numbers indicating selected compounds. The compound with number 1 is that compounds, which is most different to all the others. Compound number 2 is most different from number 1. Compound 3 is the one most different to 1 and 2 and so forth. If a dataset contains a few awkward compounds, then these are likely to be picked first. Therefore, in reality one would often skip the very first compounds of the diverse selection.


    Clustering Compounds

    Clustering is an old cheminformatics technique for subdividing a typically large compound collection into small groups of similar compounds. Clustering was used in the old days, when computational resources were expensive, to precompute similarity relationships between compounds. Cluster membership could be stored easily in databases to be quickly retrieved later, whenever the need arose to locate similar structures to any given structure, e.g. after a high-throughput screening. The inherent problem of clustering is that cluster borders are arbitrary and may separate very similar compounds into different clusters. Therefore, the retrieval of all cluster co-member of a given compound does not necessarily result in the most similar compounds.

    The cluster algorithm implemented in DataWarrior is simple, reproducible, but computationally demanding and, therefore, best used if the dataset doesn't contain far beyond 10000 compounds. First the complete similarity matrix is calculated, which can be done with any descriptor. Then, in a stepwise process the most similar compounds or clusters are merged to form a new cluster, whose similarity to the remaining compounds and clusters is re-calculated as a weighted mean from its members. The merging process continues until a stop criterion is met. Stop criteria can be defined in the cluster dialog.

    The clustering process may be defined to stop when the cluster count reaches a desired number or when the similarity needed to join two clusters falls below a definable limit. If both criteria are defined, then the clustering stops if any of both criteria are met.


    Calculating Molecular Properties

    DataWarrior is able to calculate certain physico-chemical properties, lead- or drug-likeness related parameters, ligand efficiencies, various atom and ring counts, molecular shape, flexibility and complexity as well as indications for potential toxicity. After calculating properties, these are automatically added as new columns to the data table.

    To calculate any molecular properties from chemical structures select Add Compound Properties... from the Chemistry menu. Select the properties of interest from one or more property sections and click OK.

    Properties related to the ligand efficiency are based on IC50 values and require the selection of a corresponding numerical column that contains IC50-values.

    Some properties match those available in the OSIRIS Property Explorer, which was made public in 2000 and in now maintained on www.openmolecules.org. These properties and the algorithms used are explained in more detail here.


    Enumerating Virtual Libraries

    DataWarrior can generate all structures of a virtual combinatorial library if a generic reaction is drawn and for every generic reactant a list of real reactant structures is provided. The enumerated product structures could be used to predict physico-chemical properties and to select those products with the most promising properties for synthesis or to be purchased.

    To create virtual libraries, select Create Combinatorial Library... from the Chemistry menu. Then draw or Open a generic reaction schema in the following dialog:

    • Map each atom involved in the reaction using the mapping tool.
    • You may save your generic reaction for later re-use
    • Click OK to switch from the reaction to the reactant dialog.

    • Define all reactants by editing them, drag&drop, copy/paste or loading them from a file.
    • Click OK to start the enumeration.

    When creating the product structures, DataWarrior retains the atom coordinates of the generic product. Therefore all products are later shown in the expected orientation. After all product structures have been created, DataWarrior creates some default views. Now you may continue by calculating physico-chemical properties for all virtual products, cluster them or running some other kind of analysis.


    Generating Evolutionary Libraries

    Chemical space is huge. There are estimates that the number of distinct stable molecular structures with a molecular weight in the drug-like range is about 1060. It will never be possible to compute all these structures to search them for the one with the most promising property profile for a particular purpose. Nevertheless, navigating this vast space and locating unknown promising compounds can give new ideas.

    In cheminformatics sometimes a de-novo-design approach is taken to design new structures from scratch with a high likelyhood of being active on a chosen target. Often this starts with a small fragment, to which atoms or small fragments are added satisfying a ligand or protein based fitness criterion.

    DataWarrior uses an evolutionary approach mimicking nature by randomly mutating existing molecular structures with tiny changes to create new generations of potentially better structures. Every generation of molecules is checked for fitness by a set of customizable criteria and the most promising structures survive serve as starting points for the next generation. The mutation algorithm performs changes like single atom replacements, atom insertions, bond order changes, substituent migrations, ring aromatisations, etc. For any structure being about to be mutated, all possible mutations are evaluated regarding how much the the change increases or decreases the drug-likeness (or optionally natural-product-likeness). Mutations with a change in the desires direction are assigned a higher propability that mutations that decrease drug- or natural-product-likeness. Mutations, which would create high ring strains are removed from the list.

    In the fitness panel of the Evolutionaly Library Dialog a desired compound property portfolio can be defined. For simple properties one may define an optimal numerical range. One may also require compounds to be similar or dissimilar to a definable set of compounds using any descriptor. All individual fitness criteria can be weighted to make them more or less important than others.

    In the fitness example above we look for compounds, whose chemical structure is dissimilar to any of three known inhibitors, while at the same time being similar to at least one of these inhibitors considering flexophore similarity. In other words, we are looking for compounds with a similar target binding behaviour, but with a dissimilar chemical structure to the known inhibitors.

    The picture above shows the General panel of the dialog after starting the evolutionary process. As starting point the structure of LDS has been selected, which is as good as any other starting point. The type of compounds being created is set to natural products. We see the structure of the currently mutated parent molecule, the molecules in the current generation and the overall best ranking molecule. The background color of these molecules reflect how well the fitness portfolio is already met. Any time during the evolution process one may click Stop to create a new document with the fittest structures of all generations.


    Analysing Scaffolds

    The Scaffold Analysis locates the core structure(s) of every molecule within a given column and creates a new column that contains these scaffolds. The method used to locate the core structure(s) depends on the chosen Scaffold type:

    • Plain ring systems: This mode locates all single ring and annelated ring systems without any substituents.
    • Ring systems with substitution pattern: This mode works as the previous one, but marks every ring atom as being substituted, which carries an exeo-cyclic, non-hydrogen substituent in the original molecule.
    • Ring systems with carbon/hetero subst. pattern: This mode goes a step further by distinguishing, whether a substituent's first atom was a carbon atom or a hetero atom.
    • Ring system with atomic-no subst. pattern: This mode is even more specific. Every exocyclic substituent is represented by its first atom.
    • Murcko scaffold:The Murcko scaffold contains all plain ring systems of the given molecule plus all direct connections between them. Substituents, which don't contain ring systems are removed from rings and ring connecting chains.
    • Murcko skeleton: The Murcko skeleton is a generalized Murcko scaffold, which has all hetero atoms replaced by carbon atoms.
    • Most central ring system: As the name implies, this is that ring system of the molecule, which is closest to its topological center. It does not contain any exocyclic substitution information.

    The image below illustrated the different scaffold modes. The original molecule structure is shown at the top middle position. The scaffold structures produced by any of the seven scaffold modes are shown around the original molecule.

    If the Save scaffold frequency file option is selected, then DataWarrior creates a new document listing all detected scaffold and their occurence frequency. The name and location of the scaffold file can be set after pressing the Choose button.


    Creating SAR-Tables

    Structure-Activity Relationship (SAR) Tables are frequently used to correlate biological properties with the substitution patterns of compound sets that share one or a few chemical scaffolds and often have been synthesised in a combinatorial fashion. If a dataset contains chemical structures, DataWarrior may decompose the structures by analysing scaffold and substituents and putting them into new dedicated columns. This can be done either fully automatic or a little more flexible with some user guidance. Similar functionality of this kind is sometime called R-group-decomposition or R-group-deconvolution.

    To automatically create a SAR-Table from your dataset, select Automatic SAR-Analysis... from the Chemistry menu. A dialog lets you choose the mechanism that is used to determine the scaffold.

    • Most central ring system: As the name implies, with this option that ring system of the molecule, which is closest to the topological center of the molecule, is taken as the scaffold.
    • Murcko scaffold: The Murcko scaffold of a molecule is determined by locating all ring systems of the molecule and all direct connections between them. Everything else is considered as substituents.
    Compounds without any rings are not subjected to the analysis and their cells in the new columns remain empty.

    If you need more flexibility in determining, which sub-structures should be considered the central scaffold, then you should select Core based SAR-Analysis... from the Chemistry menu. The dialog lets you define a sub-structure that may include query features. By using atom wild cards or variable bond bridges, one drawn sub-structure may detect multiple different scaffolds at once. Nevertheless, often the Core based SAR-Analysis... function needs to be used multiple times to process all scaffolds in the dataset.

    Whatever mode you use to generate a SAR-Table, for every different scaffold, the entire dataset is processed to find those scaffold atoms, which have variations concerning their substituents they carry. For all of these, a generic R-group is attached to the scaffold. If a scaffold atom is always carrying the same substituent, this substituent is attached to the scaffold. The scaffold with attached R-groups and substituents is called core-fragment and put into the first new table column. New R-group columns are added according to the needs.

    Table view with new columns added after SAR-Analysis


    Similarity Analysis

    In the recent literature1-3 the terms Molecular Similarity Analysis, Activity Cliff Analysis or Activity Landscape are hot topics. All these related methods have in common that they usually start with a 2-dimensional scaling process of the chemical space, which means that all involved molecules are positioned somehow on a 2D-area, such that similar molecules are located close to each other. This scaling could be done by running a principal component analysis (PCA) on a descriptor of the molecules and using the first two components as coordinates. Another approach would be a self organizing map (SOM) from a descriptor. Both of these options are limited in terms of the descriptor type, because they require input data to be vector, i.e. a binary or numerical array of data.

    While DataWarrior allows running PCAs or SOMs on descriptor vectors and visualizing the results as a chemical landscape, the Similarity Analysis is based on a different method. It uses a Rubberbanding Forcefield approach, which translates similarity better than a PCA, is faster than a SOM, uses the available space more efficiently and works with any type of similarity criterion including the Flexophore descriptor.

    The approach involves the following steps:

    • randomly position all molecules on the 2D space
    • calculate the entire similarity matrix between all molecules
    • locate most similar neighbors to be considered for every molecule
    • between any two neighbors assume attractive forces, which increase with similarity and distance
    • stepwise relocate all molecules parallel to the mean vector of perceived forces
    • while attractive forces decrease over time and due to lower distances, introduce increasing short range repelling forces among all molecules

    Three default views after similarity analysis

    When DataWarrior has finished the calculation of molecule positions, it creates three new default views:

    • A view depicting the chemical space of all molecules. Similar neighbors are connected with a connecting line and the markers that represent the molecules are colored dynamically by molecule similarity to the chosen Current Molecule, which changes whenever you click another marker.
    • A tree view that shows the direct neighbors of the chosen Current Molecule. When a marker or molecule is clicked on in any view, the Current Molecule changes and the tree view's content is dynamically updated to show the neighborhood of the new molecule.
    • A structure view, which is configured to show selected molecules on top, while the non-selected ones are grayed out. The highlight mode of the respective structures column is set to Current Row Similarity, causing any displayed molecule to show any structural differences to the molecule of the Current Row. Structural elements possessed by the reference molecule, which are not part of the depicted molecule, are shown in red. Structural elements of the shown molecule, which are not present in the reference molecule, are highlighted with a blue background. To change the selection of displayed molecules, you simply need to select different markers in the tree view or on the similarity map.

    Since a Similarity Analysis is very much related to a Activity Cliff Analysis, more information about how to configure and run a similarity analysis can be found at the end of the next section.

    1) Peltason L, Bajorath J; Molecular similarity analysis uncovers heterogeneous structure-activity relationships and variable activity landscapes.; Chem Biol., 2007, 14 (5), pp 489-97
    2) Guha R, Van Drie J H; Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs; J. Chem. Inf. Model., 2008, 48 (3), pp 646-658; DOI: 10.1021/ci7004093
    3) Bajorath J, Peltason L, Wawer M, Guha R, Lajiness M S, Van Drie J H; Navigating structure-activity landscapes; Drug Discovery Today, 2009, 14 (13-14), pp 698-705


    Activity Cliff Analysis

    The Activity Cliff Analysis uses the same mechanism already explained in the previous section to create a similarity map of all involved molecules. It also detects all similarity relationships between them above an automatically determined similarity threshold. To be precise, this is not a global cutoff value, but is modulated from molecule to molecule. Depending on the neighborhood situation of an individual molecules the threshold may be increased or decreased to accound for many very similar or few not even similar neighbors. This reduces singletons and untangles large clusters to some extend.

    In addition to the Similarity Analysis the so-called Structure-Activity Landscape Index (SALI) is calculated for all pairs of similar molecules. If two molecules with measured activities a1 and a2 and their structural similarity being s, then the SALI value between these molecules is defined as SALI = |a1-a2| / (1-s). The SALI value is a measure of how much activity is gained (or lost) with a relatively small change in structure. Molecule pairs that show an abrupt change in activity despite having a rather similar structure are called activity cliffs. These pairs are particularly interesting, if one tries to understand structure-activity relationships in order to design new structural motives with improved activities.

    After an Activity Cliff Analysis the generated similarity view encodes SALI values and activites in marker size and marker color, respectively. The image above shows a part of such a similarity map. In this case the dataset contained EC50 values on Cannabinoid CB1 and CB2 receptors. The marker background color reflects the receptor subtypes (CB1:pink, CB2: orange). One can easily recognize clusters of similar compounds, locate active compounds (red markers), locate activity cliffs (large markers), and even distinguish CB1 from CB2 inhibitors.


    Configuring And Running A Similarity Or Activity Cliff Analysis

    To perform a Similarity or Activity Cliff Analysis choose Analyse Similarity/Activity Cliffs... from the Chemistry menu. The following dialog appears:

    Similarity Analysis Dialog

    Similarity on: Defines the similarity criterion, i.e. the descriptor that is used for arranging molecules on the 2D-map. One may use any descriptor that DataWarrior knows of, provided that is has been calculated previously for the current data file. Most useful descriptors are SkelSpheres for fine-grained chemical graph similarity, OrgFunctions for similarity on synthetically relevant organic functionality, and Flexophore to create a molecule map based on the similarity of protein binding characteristics.

    Activity column: For a Similarity Analysis don't select a column here. For an Activity Cliff Analysis you need to select that column that contains the numerical value to calculate SALI values from. For any pair of molecules the SALI value reflects how much activity is gained with a small change of the chemical structure. Very high SALI values identify activity cliffs, i.e. those rare points in an activity landscape, where a small change of the chemical structure causes a large change in activity (or any other experimentally determined molecule property. Identifying these molecule pairs and understanding the structural cause of the activity change can be very helpful in the process of designing compounds with better properties.

    Identifier column: The Similarity or Activity Cliff Analysis detects for evey molecule its most similar neighbor molecules and writes a reference to those molecules into a new column. Therefore it needs a column that contains a key that uniquely identifies a molecule or data row. If your data contains compound identifiers you may select that column. Otherwise, DataWarrior will create a new number for that purpose.

    Separate groups by: In some cases one column contains data experimental data refering to multiple targets or measured under different conditions. If a second column contains categories describing the conditions, and if only values within the same category can be compared to each other, then you should select the category column here. Then SALI value will only be calculated from compatible experimental values.

    Similarity limit: Usually Automatic does a good job. However, if you prefer getting more or less neighborship relationships than the automatic process generates, then you may disable the automatic setting and (moderately) update the threshold defining slider. If the limit is set too high then this may cause the 2D-scaling find too little similarity relationships. The final map may then not be much different from the initial state of randomly scattered molecules. If the limit is set to low and therefore too many similarity pairs are found, then a highly interconnected bunch of molecules won't equilibrate well.

    Create view based on similarity relationships: If this option is checked, a similarity map of all molecules is created. Therefore a Rubberbanding Forcefield is employed to incrementally equilibrate 2D-coordinates for all molecules until an energy minimum is reached and all molecules are positioned close to their most similar neighbors. Afterwards a 2D-view is created to visualize the similarity map.

    Create document of structure pairs: If this option is checked, then DataWarrior creates a new document in an open window, which contains all detected similarity relationships in dedicated rows. Two columns contain the two neighbor molecules; additional columns contain molecule identifiers, similarity, activities, and SALI values.


    Continue with Molecule Similarities...