DataWarrior User Manual

Working With Chemical Structures And Reactions


DataWarrior was designed from the outset as a chemistry aware data analysis and visualization platform. Its built-in chemistry intelligence allows working with chemical structures as easily as with alphanumerical data. Rows may be filtered based on whether molecules contain certain sub-structures or on various kinds of molecule similarities. Views may display molecular structures as labels or on the axes. Various kinds of molecule similarities can be translated into marker positions, sizes or colors. Data analysis methods like the principal component analysis or self organizing maps can be applied to chemical structures equally well as to alphanumerical data.

Dedicated cheminformatics functionality provide for state-of-the-art analysis methods from the mere similarity comparison of two molecule lists to more advanced methods as an activity cliff analysis, SAR-table creation or the enumeration of virtual combinatorial substance libraries. 2-dimensional scaling methods help to visualize compound collections and the prediction of compound properties to support the characterization and selection of compounds.


Chemical Structure Fundamentals

Internally DataWarrior represents chemical structures as so-called ID-Codes, which are very compact text strings that encode chemical structures in a canonical form. This means that no matter, in which order the atoms and bonds have been drawn, if two structures represent the same molecule, then their ID-Codes are also identical. ID-Codes contain aromaticity information and all stereo features. DataWarrior fully supports the concept of Enhanced Stereochemical Representation (ESR), which not only considers basic stereo features (configurations stereo centers and double bonds), but also the relationsships between them. This allows to define mixtures of enantiomers, diastereomers, or epimers, when drawing the structure. The ESR information is encoded in a canonical form into the ID-Codes and DataWarrior is using it, whenever appropriate, e.g. when displaying or editing chemical structures, during substructure search, or when generating conformers.

ID-Codes may also represent substructures or query fragments as opposed to normal molecules. Then they may contain atom and bond related query features. Reactions can be represented by ID-Codes as well.

Because atom coordinates cannot be part of a canonical representation, ID-Codes don't include them. Instead, DataWarrior usually stores atom coordinates in a separate, invisible child column associated with the ID-Code column. If atom coordinates are 3D, then that atom coordinate column contains conformers.

Much, of what you can do with molecular structures in DataWarrior, especially if it is related to structure similarity, requires the presence of descriptors derived from the chemical structure itself. The kinds of descriptors, which can be generated and used by DataWarrior, are explained in more detail in a section called Similarity & Descriptors. Like atom coordinates, descriptors are stored in invisible child columns that are associated with a parent ID-Code column. To calculate a new descriptor for a given chemical structure column you need to select its name from the Chemistry->From Chemical Structure->Calculate Descriptor menu.

The Chemistry->From Chemical Structure menu contains various additional items, which allow adding a new column with data derived from the chemical structure:

Add Molecular Formula...: Adds a new column containing the molecular formular of the chosen structure column.

Add SMILES Code...: SMILES codes (simplified molecular-input line-entry system) are a widely used text encoding of chemical structures that can be interpreted by humans. SMILES were originally developed by David Weininger and used by Daylight Information Systems. Basic SMILES were neither encoding stereo features, nor were they canonical. Extended versions (isomeric, absolute, and unique SMILES) solved this issue. However, it should be noted that SMILES codes cover only basic stereo features. An enhanced stereo representation (ESR) concept is not part of it. Canonical SMILES from different vendor's software packages are not necessarily identical, even if they encode the same molecule. There is no standard that would define the rules for producing canonical SMILES, but the OpenSMILES Specification defines the syntax for isomeric SMILES, which contain the molecular graph including aromaticity and basic stereo features. DataWarrior generates absolute SMILES, i.e. canonical SMILES with stereo features. Thus, keep in mind that you loose all ESR information, when generating and using SMILES for other purposes.

Add Standard Inchi...: InChI is an International Chemical Identifier developed under the auspices of IUPAC, the International Union of Pure and Applied Chemistry [1], with principal contributions from NIST (the U.S. National Institute of Standards and Technology [2]) and the InChI Trust. DataWarrior's Inchi support is based on the JNI-Inchi project, which implemented Inchi version 1.03. Like SMILES code, Inchi does not support Enhanced Stereochemical Representation (ESR).

Add Inchi Key...: The Inchi key is a hash code of Standard Inchi, which means that it is more compact, and a code uniquely derived from the structure, but it cannot be converted back into the chemical structure.

Add Canonical Code...: This method constructs a unique hash code from the chemical structure. During this process it may neglect stereo information, normalize tautomeric states, and/or remove small fragments. Because of this such hash codes may later be used to locate stereo isomers, tautomers, different salts of the same structure, or/and combinations of this. Typically, one would afterwards remove redundant structures, or make lists of duplicate or distinct structures and use these lists for further analysis.

After selecting this functionality a dialog opens letting you choose the structure column, and whether to distinguish stereo isomers, whether to distinguish tautomers, and in case of multiple fragments whether to use the largest fragment only. If, for example, you would neither select to distinguish tautomers nor stereo isomers, then DataWarrior would take the structure, remove all stereo information, construct a normalized tautomer, make a canonical representation from it and finally derive a hash code from that. This way different stereo isomers and different tautomers would end up in the same structure and, therefore, produce the same hash code.

Add Substructure Count...: This function determines for every molecule in a table column, how many times a given substructure can be found inside the molecule's structure. In addition to the substructure fragment itself, one may define whether the substructure search algorithm should only count separate matches or whether matches shall be considered, which include atoms being part of an earlier counted match. When counting benzene rings, then it would be found one or two times in naphtalene, depending on this setting.

Extract Fragment...: If a molecular structure consists of multiple disconnected fragments, this function can be used to put a copy of the relevant fragment into a new column. Typically, this is used to remove counter ions from salts, to remove water molecules, or to remove otherwise not important small molecules from the largest and supposedly relevant molecule. Therefore, the default option is to extract the largest fragment, but one may also provide a substructure, which a fragment must contain in order to be recognized as the relevant fragment.
If the option Neutralize charges is selected, then DataWarrior tries to remove charges to neutralize the new fragment. If Convert to sub-structure is selected, then the extracted fragments are put as sub-structures instead of molecules into the new column. Substructures differ from molecules in two ways: First, atoms with open valences are not assumed to be filled with implicit hydrogen atoms. Second, a double click opens the structure editor in sub-structure mode, which lets you define atom and bond query features as well as define exclude groups. Obviously, substructures can be used as criteria in a substructure search, e.g. to query a database. In DataWarrior database queries may contain multiple substructures at once. Often, you can select multiple rows before opening a database query dialog (e.g. for building block search). After opening the search dialog then, you may select an option to search the database for all these selected (sub-)structures at once.

Dialog configured to count hydroxy groups in molecules.


The Structure Filter

If a DataWarrior file contains chemical structures, then usually the filter panel contains one or some Structure Filters. In any case one can always add new Structure Filters by selecting New Filter... from the Edit menu.

A Structure Filter option menu allows to switch between two general operation modes:
  • contains hides compounds (not) having a certain a sub-structure
  • is similar to [...] hides rows with (dis-)similar structures. The descriptor used is given in brackets.

  • A double-click on the filter's structure field opens the Osiris Structure Editor. Please note that the behavior of the editor differs between the sub-structure search and the similarity search. In the former case you edit a potentially substituted chemical fragment that may contain query features while in the latter case you define a complete molecule.

    Structure editor in molecule mode.

    To edit a structure one typically selects a tool from the buttons on the left and uses the mouse to apply the tool in the right structure area. The keyboard can be used to accelerate the drawing process, e.g. typing a digit adds a chain of n atoms or changes a bond order depending on whether the mouse pointer is on top of an atom or a bond. '+' or '-' change an atom charge and Del removes selected atoms. Typing one or more letters changes an atom type. More options to modify an existing atom are available in the atom dialog, which opens after a double click on an atom. A single click just applies the defined options to the clicked atom.

    Atom dialog to define isotops, radical state, abnormal valences and custom atom labels.

    Most of the structure editor's buttons are rather self explanatory, while some need a little more explanation. For a detailed description of the editor please refer to the Structure Editor section.


    Substructure Search

    A filter in substructure search mode hides all rows, whose structures do not contain the drawn molecule fragment. Required atom or bond features of the query fragment can be defined to narrow the search. A double click on an atom or bond after selecting the lasso tool opens an corresponding atom or bond query feature dialog. If multiple atoms or bonds have been selected, then the dialog's settings are applied to all selected atoms or bonds when the dialog is closed.


    Atom query feature dialog of structure editor in fragment mode.

    The first option in the atom query feature dialog lets you convert a defined atom into a wild-card. In the following field allowed atoms one may specify more precisely, which atom types are allowed at this position. If any atomic number is selected and therefore the atom is declared a wild-card then the allowed atoms field turns into an excluded atoms field. This causes the struture search to accept any atom but the explicitly excluded ones. While these options make the query fragment less specific, all following options narrow the search. Most of these options are self explanatory.

    The option is part of an exclude group is a powerful feature that goes well beyond a standard substructure search. This option lets you mark one or more parts of a query structure to represent unwanted structural features. For instance, you may use an exclude group to search for nitrogen atoms not having a carbonyl group attached, or for a fluoro group that is not part of a di-or tri-fluoro substitution, for pyridine-groups without methoxy in meta-position, or for ketones that don't have an aldehyde in the same molecule. Exclude groups are also very useful in the context of generic reactions to build combinatorial libraries.

    Exclude groups are drawn in dark red with a pink background.

    The image above shows four typical small sub-structure queries using exclude groups:

    • an imidazole ring not being substituted by a carbonyl group at its 1-position
    • a nitrogen atom connected to at least one hydrogen, not bearing a carbonyl group, and not being connected to a non-carbon atom
    • a molecule containing exactly one nitrogen atom
    • a molecule not containing aromatic carbon atoms
    If a query fragment contains any exclude groups, then the substructure search process runs in two steps. First it locates all matches of the query fragment not considering exclude groups. For every match found in the first step, a second step tries to extend the original match to include one after another of the exclude groups. If it succeeds with one of the exclude groups, then the original match is skipped.

    In the Structure Editor atom or bond query features are usually reflected in the drawing in an obvious way or indicated with small letters (s:further substituted, a:aromatic, h:hydrogen count, c:charge, n:neighbour count, r:ring size, etc.). Some complex query features may not be shown directly. In any case atoms and bonds with query features show a yellow spot in their background to indicate their special meaning.

    Query fragment with yellow markers to indicate invisible query features.

    In the bond query features dialog one may select more than one bond types, e.g. Single and Delocalized as in the example below. This would match the query bond to both, single and delocalized bonds. It is important to understand that DataWarrior considers 5-membered aromatic rings not to be delocalized. Therefore, the Delocalized option does not match to any bond of an aromatic 5-membered ring, unless it is annelated to an aromatic 6-membered ring, which is considered delocalized and which causes the shared bond between the two rings to be delocalized as well. In general DataWarrior considers aromatic bonds with no preferred mesomeric structure as being delocalized.

    Bond query feature dialog of structure editor in fragment mode.

    An Atom Bridge option allows to convert any bond of the query structure to represent a chain of multiple connected atoms rather than a direct connection. One defines the allowed number of atoms within the chain not counting the atoms already drawn as part of the query structure.

    If Highlight Structure By -> Recent Filter is selected in column header menu of the structure column in the Table View, then the query fragments as part of the entire molecule are drawn in dark red, while the rest of the displayed structures is usually drawn in black.


    Structure Similarity Search

    To find structurally similar compounds rather than compounds sharing a specific sub-structure, select any of the is similar to [...] options. Then double-click the filter's structure area and draw a molecule using the Structure Editor or drag and drop a molecule from somewhere else to the filter. You may adjust the similarity level to your needs. Typically, chemists will perceive molecules as very similar if their similarity value is about 0.90 or above.

    If Highlight Structure By -> Recent Filter is selected in column header menu of the structure column in the Table View, then structural elements that are shared between displayed structures and the query structure are highlighted in green.

    Per default Datawarrior calculates a FragFp descriptor of the first structure column within the data table. This descriptor can be used to calculate similarities between molecules. The FragFp similarity between two molecules is the number of fragments that both molecules have in common devided by the number of fragments being found in any of the two molecules. If your dataset contains other descriptors, then your filter menu contains associated is similar to [...] options that you may choose to filter by another similarity criterion. Other descriptors can be calculated by choosing Chemistry->From Chemical Structure->Calculate Descriptor->... from the menu. Once the calculation has been finished, the associated similarity option gets available in the filter menu. Here you can find more information on descriptors and similarities and which kind of similarity should be used for which purpose.


    Reaction Sub-Structure Search

    If an open DataWarrior window contains a column with chemical reactions, then typically, the filter panel shows two filters offering reaction search, a reaction filter and a retron filter. More filters can be added with Edit->New Filter..., e.g. structure filters to search the reactants or products. Like the structure filter a reaction filter (see below) has an option menu to choose between two modes, contains and is similar to [RxnFp]. If the contains mode is active, then one may draw, drag or paste in a reaction sub-structure. This is a generic reaction of which all reactants and products consist of incomplete sub-structures. Their atoms and bond may also contain query features, i.e. additional conditions for atoms or bonds to be considered a match. Moreover, reactant atoms that exist on the reaction's product side, should be properly mapped with the reaction editor's mapping tool.

    Once a reaction sub-structure is defined, DataWarrior performes a search in the reaction column to hide those rows, whose reactions don't match the specified reaction. For that DataWarrior runs a sub-structure search on the reactants and on the products. Furthermore, it checks, whether sub-structure matches are consistent with the defined atom mapping.

    The reaction filter shown above contains a reaction sub-structure describing an intramolecular Suzuki reaction. The carbon atoms are decorated with a small 'a', which indicate that they carry a query feature to match aromatic atoms only. The bond on the product side, which is formed by the reaction, is marked with the label 'r!a', because two query features define it to be a non-aromatic ring bond. Effectively, this makes the reaction an intramolecular one. What can not be seen in the above image, is that both carbon atoms on the reactant side are mapped to the two product carbon atoms. The '[4]' indicates an atom list with four atoms. In this case these are Cl,Br,I, and O.

    Please note that the example reaction is intentionally incomplete: Both leaving groups, the boronic acid substituent as well as the halogene/oxygen atom are not drawn on the product side and therefore don't carry atom mapping numbers. This is intentional, because in typical reaction collections, these leaving groups are not part of the specified products either. If we would have included them in the query, then the sub-structure search on the product would not match those reactions, which are stored without leaving groups.


    Reaction Similarity Search

    If an open DataWarrior window contains a column with chemical reactions, then this reaction column will typically have an associated RxnFp descriptor in an invisible column. Typically, in this case there will also be a reaction filter among all filters on the right. If this is missing, you may add it with Edit->New Filter..., choose Reaction [Reaction], which means 'Reaction'-filter for column 'Reaction', and press OK. If the RxnFp descriptor doesn't exist, you may choose Chemistry->From Chemical Structure->Calculate Descriptor->RxnFp to generate it.

    If the filter mode is is similar to [RxnFp], then the two sliders are also active, which let you set similarity thresholds for both, the reaction center and the reaction periphery. The reaction center refers to those reactant and product atoms, which change direct connections to any neighbor atoms. These may be a broken or newly formed bond or just a change in the bond order. The reaction periphery designates all remaining atoms, those that don't belong to the reaction center.

    For reaction similarity filtering not only the reaction column needs to contain a proper atom mapping. The query reaction used in the filter needs the same. If the query reaction was taken by copy/paste or drag&drop from a mapped reaction column, then it should be properly mapped. If it, however, was drawn by hand, then it should be carefully mapped with the editor's mapping tool.

    When drawing a reaction, one should usually take care that all reaction center atoms are present on the reactant and on the product side and that they are properly mapped. However, typical reaction databases are sometimes incomplete in this regard. Often leaving groups don't show up on the product list. Small reactants, especially if the solvent serves as reactant as well, are frequently not part of the explicit reactant list. Examples are methylations of hydroxy groups, where methyl-iodide often is missing, or deprotection reactions where the protective group disappears into nothing. For best search results it is advisable to adapt your query style to the dataset you are searching.


    Reaction Retron Search

    The retron filter offers a simple and yet surprisingly effective reaction search method. Retrons are substructure fragments that exist one or multiple times on the product side of the reaction, but don't exist or exist with a lower count among the reactants. Therefore, the retron substructure must have been built in the course of the reaction. Different to other reaction search methods, a retron search does not even require searched reactions to be mapped. Since retrons are simple sub-structures, they may also contain atom and bond query features.

    The retron filter shown above searches a reaction collection for the synthesis of 1,3,4-thiadiazoles without any substituent in position 2 and without an attached nitrogen at position 5.

    The only requirement for a retron filter to work is that the FragFp descriptor has been generated for both, reactants and products of the reaction column to be searched. Typically, these descriptors are calculated as default. However, if missing they may be calculated selecting e.g. Chemistry->From Chemical Reaction->Calculate Descriptor->Reactants->FragFp.

    In detail the retron search algorithm runs a sub-structure search on the product side and determines how often the particular fragment can be found. If the fragment is found at least once, then the number of sub-structure matches on the reactant side is determined. If the fragment is contained more often on the product side, then the reaction is considered a match, because the assumtion is that this fragment must have been built. Some reactions turn out to be false positive hits, e.g. if a significant side product as shown beside the main product. If the reaction is mapped, side products can be identified by their duplicate mapping numbers. Therefore, DataWarrior applies a mapping number uniqueness check to remove false positives, if the reactions are mapped.


    Find & Replace with Text and Chemical Structures

    Usually, Find & Replace functionality is performed on text data. For this purpose the Find & Replace dialog (left) lets you select a column, a text string to be searched for and another piece of text as a replacement string. You may also choose all empty cells or any cell content to be replaced by a new content. In addition you may define, whether the replacement shall take place in all rows, or whether it shall be restricted to visible or selected rows only. This way you may for instance choose to replace any content of all selected cells with some new content.

    When using this dialog to replace text, then the option Find regex: allows to specify a regular expression instead of a simple character sequence to be matched and replaced. Such an expressions may also be used to define a zero length sub-string being characterized by neighbour characters. This would effectively insert the replacement string rather than replacing existing chanarcters. An example would be the expression "(?<=\D)(?=\d)|(?<=\d)(?=\D)", which locates any text position that separates a digit from a non-digit character. When giving this regex and a space character as the replacement string, the string "acb123def456" would be changed to "abc 123 def 456".

    Find & Replace dialogs for alphanumerial and structure columns.

    If the target column doesn't contain text but chemical structures, then the dialog changes and shows two substructure fields instead of text fields to define the search and replacement fragments (dialog on the right). For replacing one substructure by another, DataWarrior needs to know, which attachment points on the search target match which atoms on the replacement fragment. This is defined by adding R-groups to both fragments. In oder to replace a phenyl group with a cyclohexyl substituent, one needs to draw phenyl-R1 and cyclohexyl-R1. Atoms of the search fragment, which don't carry an R-group, are considered not to carry any more substituents. To define linkers or scaffolds instead of single bonded substituents one needs to attach two or more R-groups to the search target and replacement fragment.

    If neither search nor replacement fragment carry any R-group, then the search fragment effectively defines an unconnected structure to be replaced. If in this case the replacement fragment is kept empty, then the search fragments are just removed from the target structures.

    Typical structure related use cases are:

  • Replace a given scaffold/linker/substituent by another one
  • Fill empty structures cells with a given structure
  • Remove certain unwanted fragments, e.g. HCl
  • Remove all selected structures

  • Unifying 2D Atom Coordinates

    Typically, every column that contains chemical structures is accompanied by an invisible column with 2-dimensional atom coordinates, which allow DataWarrior to draw chemical structures in their original orientation. When no 2-dimensional atom coordinates exist, e.g. if an input file contains 3-dimensional coordinates only or if the structures were created from Smiles codes, then DataWarrior creates atom coordinates on the fly whenever a structure is displayed. If original atom coordinates are not satisfactory or if molecules with shared scaffolds shall always be drawn with the same scaffold orientation, then one should (re-)generate new atom coordinates for a given structure column. To create new atom coordinates select Generate 2D Atom Coordinates... from the Chemistry menu. A dialog opens where one can choose the structure column to be used.

    Options for 2D atom coordinate calculation and unification.

    If no further options are selected and OK is pressed, then DataWarrior tries to generate atom coordinates for every structure individually. If, however, as in the example above, some scaffold are defined, then DataWarrior checks every structure of the selected column, whether it contains any of the given scaffolds. If a scaffold is found then the corresponding atoms' coordinates are copied from the scaffold and the remaining atoms' coordinates are optimized around the scaffold without touching the scaffold's orientation.

    If Automatically detect scaffolds... is selected, then DataWarrior still considers manually defined scaffolds, if there are any. In addition it processes all structures, which don't contain manually defined scaffolds. From these structures it compiles a unique list of unsubstituted scaffolds. Then it generates atom coordinates for every one of these scaffolds. Afterwards it processes all structures again. Every structure that contains a scaffold of the list will receive new atom coordinates that contain the scaffold's optimized coordinates. This ensures that compounds based on the same scaffold are always drawn in the same orientation.
    The algorithm used for automatic scaffold location can be selected from among these:

    • Most central ring system: Imagine one removes all atoms and bonds that are not part of any ring of a given molecule. In this case we retain all separated and unsubstituted ring systems of this molecule. The most central ring system is the one that is topologically closest to the center of the molecule.
    • Murcko scaffold: If we locate all ring systems of a molecule and add all atoms and bonds that directly connect different ring systems, and remove all other bonds and atoms, we retain the so-called Murcko scaffold. Basically it is the original molecule with all substituents removed that do not contain any ring.
    Both options require the existence of at least one ring in the molecule. In molecules that don't contain rings no scaffold is found. Hence, newly created coordinates for these molecules won't be influenced by the list of found scaffolds.


    Comparing Structure Files

    Sometimes one needs to compare two rather big sets of compounds for overlaps, i.e. for compounds within one set, which have (or don't have) a similar or equal counterpart in the other set. Potential use cases may be:

    • locating commercially available compounds, which are substancially dissimilar to any compound of an existing screening library. In a second step one might purchase a subset of those compounds that matches a desired physico-chemical property profile.
    • a virtual screening for those commercially available compounds, which are similar to at least one of a known set of bioactive compounds. Both, chemical or flexophore similarity may be useful in this case.
    • checking a commercial compound set against an in-house collection for equals considering tautomers and different salts as being equal.

    This task compares all compounds of the currently active DataWarrior window with any compound of another specified compound file. This compound comparison may either use a descriptor similarity or it may be an exact compund match, which may or may not include other stereo-isomers, tautomers and/or salts.

    The comparison results may be used in various ways. First, the open window will receive a new column containing the most similar structure of the other file and another new column containing the similarity value. Optionally, one may select more information from the external file to be included into new columns of the open window, e.g. a compound identifier. Second, one may also create new files containing all similar structures or all dissimilar structure from the external file. And third, one may write all compared compound pairs or just the similar ones into a new file.

    Options for locating (dis-)similar compounds between two files.

    File: Click Choose to select the external structure file, which the active window's structures shall be compared to. This file may also be the very file that was opened to show the active window.

    Structure column & comparison method: One may choose either a descriptor of a chemical structure column or the [Exact] method. In the first case a similarity slider will let you define a similarity limit, while with the latter option you may select, whether to neglect stereo-, tautomer-, and/or salt information, when doing an otherwise exact structure match. You may only choose among those descriptors, which already exist in your open window. If an external file does not contain the selected descriptor then it is calculated on the fly.

    Similarity limit: This defines the similarity threshold above which two compounds are considered similar. If a compound of the open DataWarrior window has similar counterpart(s) above this limit, then the most similar of these structures will be shown in a new column of the active window. If a compound pair file is written, then this will include pairs only, if their similarity is above this threshold. The similarity slider is visible only, if a descriptor is selected in the option above.

    Neglect stereo features, Consider tautomers equal, Consider largest fragment only: In case of an exact structure match, these options allow to weaken the compound comparison by considering any stereo-isomers, any tautomers, and/or any salts of otherwise equal compounds as still being equal. These options are only available, if an [Exact] comparison is selected above.

    Select columns: This allows to select columns from the compared file, which shall be included as new columns into the current window and populated with the respective information of the most similar compound for each row.

    Save similar compounds to file: Select this to define a file name for storing all compounds of the external file that are found to be similar to any of the current window's compounds.

    Save dissimilar compounds to file: Select this to define a file name for storing all compounds of the external file that are not found to be similar to any of the current window's compounds.

    Save similar compound pairs to file: Select this to define a file name for storing all compound pairs that are more similar than the defined similarity limit.

    Compound-ID of this dataset: Here you may select a column from the active window containing compound identifiers, which will be written into the compound pair file.

    Compound-ID of external file: Here you may select a column from the external file containing compound identifiers, which will be written into the compound pair file.

    Create half matrix: This option is only available, if the active window's content was read from the same file that it is compared against. If this option is selected, then only half of the similarity matrix is processed, because this already covers all possible compound pair combinations.


    Selecting Diverse Compounds from Large Set

    This function is an efficient implementation for locating a most diverse subset within a given set of molecules. The algorithm can be preloaded with a second set of molecules, causing the algorithm to select molecules, which are both, most different to any molecule in the secons set and highly diverse among the selection. Especially for this reason, this function is perfectly suited to select diverse screening compounds from a provider's catalog avoiding any compound being similar to already available in-house compounds.

    Dialog configured to select 50000 diverse compound different to currentLibrary.dwar.

    All binary descriptors can be used with this algorithm. After computing the desired number of diverse compounds a column is added to the dataset with ascending numbers indicating selected compounds. The compound with number 1 is that compounds, which is most different to all the others. Compound number 2 is most different from number 1. Compound 3 is the one most different to 1 and 2 and so forth. If a dataset contains a few awkward compounds, then these are likely to be picked first. Therefore, in reality one would often skip the very first compounds of the diverse selection.


    Clustering Compounds

    Clustering is an old cheminformatics technique for subdividing a typically large compound collection into small groups of similar compounds. Clustering was used in the old days to precompute similarity relationships between compounds, when computational resources were expensive and the calculation of compound similarities time-consuming. Cluster membership could be easily stored in databases to be quickly retrieved later, whenever the need arose to locate similar structures to any given structure, e.g. after a high-throughput screening. The inherent problem of clustering is that cluster borders are arbitrary and may separate very similar compounds into different clusters. Therefore, a list of all cluster co-members of a given compound has some overlap with, but differs from the set of its most similar compounds.

    The cluster algorithm implemented in DataWarrior uses a hierarchical, bottom-up approach. This simple algorithm gives perfect and reproducible results, but is computationally demanding and, therefore, best used if the dataset doesn't contain far beyond 10000 compounds.

    The algorithm works as follows: First the complete similarity matrix triangle over all n compounds is calculated using a descriptor of choice. Then, the most similar compounds are joined into the first cluster. The similarity matrix is updated by removing all similarity values related to these two compounds and replacing them by a new set of similarities between luster center and any other compound. The new values are calculated as mean of the two original similarity values. Step by step the process continues by joining the most similar compound/cluster pair and merging the corresponding similarity values as a weighted mean based on the number of cluster members. The merging process continues until a stop criterion is met. Two possible stop criteria can be defined in the cluster dialog.

    The clustering process can be configured to stop when the cluster count went down to reaches a predefined number or when the similarity needed to join two clusters falls below a definable limit. If both criteria are defined, then the clustering stops once any of the two criteria are met.


    Calculating Molecular Properties

    DataWarrior may calculate or estimate various properties directly from the chemical structure. These include physico-chemical properties, lead- or drug-likeness related parameters, ligand efficiencies, various atom and ring counts, molecular shape, flexibility and complexity as well as indications for potential toxicity. After calculating properties, these are automatically added as new columns to the data table. If chemical structures contain small disconnected fragments as water molecules or counter ions, then these are removed before the property calculaten, i.e. properties are always calculated for the largest fragment only. An exception are the total molweight and the number of disconnected fragments, which both refer to the unstripped input structure.

    To calculate any molecular properties from chemical structures select Add Compound Properties... from the Chemistry menu. Select the properties of interest from one or more property sections and click OK.

    Properties related to the ligand efficiency are based on IC50 values and require the selection of a corresponding numerical column that contains IC50-values.

    Some properties match those available in the OSIRIS Property Explorer, which was made public in 2000 and in now downloadable on www.openmolecules.org. Some of these properties and the algorithms used are explained in more detail here.

    Please note that pKa values and some properties that are derived from pKa-values cannot be selected, because DataWarrior currently does not have access to an open-source method to reliably calculate pKa-values from chemical structures. However, if you have a license to calculate pKa values with software from ChemAxon, then you may download the ChemAxon pKa-plugin for DataWarrior capka.jar from the ChemAxon website. If you put the downloaded 'capka.jar' file into the DataWarrior installation folder, then DataWarrior will automatically recognize the plugin and allow you to select and calculate those formerly greyed-out properties.


    Enumerating Combinatorial Libraries

    DataWarrior can generate all structures of a virtual combinatorial library, given that a generic reaction is defined and that for every one of its generic reactants a list of real reactant structures is provided. Enumerated product structures may then be used for many purposes, such as predicting physico-chemical properties, running pharmacophore searches, or docking them as potential ligands structures into a pocket of a protein structure. Products structures with the most promising properties may then be selected for synthesis.

    Generic reactions consist of chemical sub-structures that define compatible reactants and created product(s) of a reaction. The reactant side often contains atom and bond query features. Especially, exclude groups are frequently used to more specifically define chemical functionality to guide chemical reactivity and compatibility issues. Generic reactions should also contain atom mapping information, which assigns product atoms to reactant atoms and this way serves as a transformation description of how to construct the product from the reactants in term of which bonds are broken, formed, or changed. Ideally, reactant structures carry query features restrictive enough that, if used within a substucture search in a building block database, only such molecules would be retrieved, which would react under typical reaction conditions.

    To create a virtual library, select Enumerate Combinatorial Library... from the Chemistry menu. To start you may select one of the predefined templates, which are taken from Hartenfeller's published1 collection of robust organic reactions. Or you may copy/paste or drag&drop a reaction from somewhere else, or draw one yourself. You may also load a reaction from a reaction file in RXN format. Note that RXN-files created by DataWarrior are compatible with other applications, but include a DataWarrior specific reaction encoding, which ensures that all query features are exactly reproduced, when the files is re-opened again by DataWarrior. Some of DataWarrior's query features as exclude groups are not supported by standard RXN-files.

    Generic reaction with selected mapping tool and mapping numbers visible.

    If you decide to draw a new generic reaction yourself, then you may follow these steps:

    • Draw all needed atoms and bonds of all reactants and product(s) to properly define the reaction. Typically, at least those atoms must be drawn, which change their bonding throughout the reaction.
    • Please make sure that all atoms, which exist on the reactant and the product side, get properly mapped using the mapping tool .
    • You may use query features and exclude groups on the generic reactant side to constrain reactant structures. This is done for two reasons: First, the reactant (sub-)structure is used to discard non-qualifying reactant molecules, when opening a reactant file. Second, more closely defined reactant structures help the reactor to identify the correct functional group within a molecule, if there are multiple similar groups.
    • You may save your generic reaction for later re-use. Reactions are saved as RXN-file, which is a de-facto standard introduced by MDL decades ago, and ensures that other applications can read reactions exported by DataWarrior. Since some query features and especially exclude groups are not supported by RXN-files, DataWarrior includes its native format in the file, such that theses features are not lost, when DataWarrior re-opens its own RXN-files.
    Once the generic reaction is defined, click on Reactants to switch from the reaction to the reactant panel.

    Reactant structures were chosen for first of two generic reactants.

    In the reactant panel you need to provide real reactant structures, which are then used to construct your products. The easiest way is to let DataWarrior suggest matching structures from the Enamine building block database. For that click the Suggest... button. It opens a dialog that lets you specify (in addition to the already defined substructure) a maximum price, a minimum package size, the number of wanted reactants, and a strategy to choose reactants, if more reactants match you conditions.

    Option dialog to suggest matching reactants from the Enamine catalog.

    If you have your own file of pre-selected matching reactants or just a file containing any available chemicals, then you may read all matching structures from that file with a right mouse click into the respective reactant area and choosing Add From File.... In this case an automatic sub-structure search ensures that only those molecules are used, which contain the defined generic reactant. In addition or alternatively, you may add or change reactant structures using drag&drop or copy/paste. A right mouse click within the structure area opens a popup menu that lets you add, remove and edit individual structures. Once all reactant structures are defined, you may start the enumeration by clicking OK.

    When creating the product structures, DataWarrior retains the atom coordinates of the generic product. Therefore all products are later shown in the expected orientation. After all product structures have been created, DataWarrior creates some default views. Now you may calculate physico-chemical properties for all virtual products, calculate Flexophore similarities, generate conformers, cluster the products or run some other kind of analysis.

    Note: Usually, query features are used on the reactant's side only. However, if you use query features on a reactant bond to allow multiple bond orders and if you draw the respective product bond without defining multiple bond orders, then the product is constructed with the explicitly drawn bond order, no matter whether any used reactant molecules have a single, double or triple bond at that position. In order to tell the reactor to retain the original reactant molecule's bond order, you need to assign to the product bond the same bond types that you have defined for the reactant bond. You may even define a bond order increase or decrease for bonds with multiple allowed bond types, e.g. by assigning single or double to a reactant bond and double or triple to the respective product bonds.

    1) Hartenfeller M, Eberle M, Meier P, Nieto-Oberhuber C, Altmann K-H, Schneider G, Jacoby E, Renner S; A Collection of Robust Organic Synthesis Reactions for In Silico Molecule Design.; J Chem Inf Model, 2011, 51, 3093-3098


    Generating Evolutionary Libraries

    Chemical space is huge. There are estimates that the number of distinct, stable molecular structures with a molecular weight in the drug-like range is about 1060. It will never be possible to compute all these structures to search them for the one with the most promising property profile for a particular purpose. Nevertheless, fishing for yet unknown promising structures in this vast compound space may be successful if the approach is right. It may then lead to completely new ideas or starting points, which would not be possible with the traditional structurally constrained virtual screening of existing or virtual combinatorial libraries.

    Computational methods are called De Novo Drug Design, if they aim for suggesting entirely new molecules, which are supposedly active on a chosen drug target. DataWarrior uses an evolutionary algorithm for this purpose, which somewhat mimicks nature's evolution of plants and animals. The algorithm starts with a small initial set of molecules called the first generation. From any of these molecules it then creates multiple derived, new, but similar molecules by applying a small random structural modification. All derived structures together form the first offspring generation, which is much larger that its parent one. The kind of modifications applied are single atom replacements, atom insertions, bond order changes, substituent migrations, ring aromatisations, stereo center inversions, etc. Whenever DataWarrior needs to create derived structures from a parent one, it first compiles a list of all possible modifications. Each modification is then evaluated in regard to as how much it increases or decreases the structure's drug-likeness (or optionally natural-product-likeness). Modifications that increase the drug-likeness are assigned a higher probability than mutations that decrease it. Changes, which would create high ring strains are removed from the list. These modulation of modification propabilities ensures that the algorithm stays in the chemical space of drug-like (or natural-product-like) compounds.

    In a next step the algorithm applies customizable fitness criteria to rank the new generation's molecules according to these criteria. The highest ranking molecules from this generation are selected to survive and form the parent molecules for the next generation. Typically, after about one or a few hundred molecule generations DataWarrior has arrived at structures that optimally match the defined fitness criteria.

    Usually, one defines multiple fitness criteria in the Evolutionaly Library Dialog. Criteria may be weighted to make them more important or less important than others. Together all defined criteria form the desired compound property profile. Simple fitness criteria consist of an optimal numerical range for a computable compound property. Others require compounds to be similar or dissimilar to a definable set of compounds using any descriptor. Most complex and computationally expensive are 3-dimensional similarity based criteria, because they involve the creation of conformers for every evolutionary made candidate. At the same time they are the most applealing fitness options, at least if one seeks to find novel ideas to replace a known binding ligand with known 3D-coordinates.

    Evolutionary library options and fitness criteria defined to match shape and pharmacophoric features of known Ligase inhibitor, while being structurally different from the same inhibitor, and having a cLogP value below 4.0.

    In the fitness example above we look for compounds, whose chemical structure is dissimilar to a known Ligase inhiobitor, but nevertheless have at least one low energy conformer that matches in shape and position of pharmacophoric features the binding conformer of the Ligase inhibitor. The query structure was taken from the 4MUG entry of the PDB database. In this case the first generation consists of small druglike random molecules and the type of molecules to create are also defined to be drug-like. After defining all criterial and pressing OK, the evolutionary algorithm starts designing molecules. A window opens that shows the progress, the structure of the currently mutated parent molecule, the molecules in the current generation and the overall best ranking molecule. The frame color of these molecules range from red to green and indicate how well the fitness portfolio is already met.

    Window showing progress and best matching molecules during evolutionary library optimization.

    If defined then the evolution stops automatically once no progress is achieved over some generations, or one may click Stop to hold the process any time manually. Then, DataWarrior creates a new window with the best molecules from all generations and a view that shows how fitness increased over time. This view also highlights the best molecule and its parent molecules over all previous generations back to the first random ancestor.

    Window showing progress and best matching molecules during evolutionary library optimization.


    Generating Random Molecules

    There have been a few recent publications describing Long Short-Term Memory (LSTM) networks to generate SMILES codes representing new random molecules. Typically, these networks are trained with large text files containing thousands of SMILES encoded drug-like molecular structures. LSTM networks, when trained with a sequence of characters, learn the propability of specific characters to follow after a sequence of previous characters. Thus, a network being trained with thousands of SMILES codes, can make reasonable suggestions for the next character when presented with a starting part of a SMILES code. This way it may be used to construct new SMILES codes from scratch, which are composed of patterns that the network has seen during the training phase. Since LSTM networks cannot understand the grammar of SMILES, they produce a certain percentage of invalid SMILES codes, which must be filtered out in a second step. The major cause for invalid SMILES are atom labels consisting of two atoms, aromaticity encoding by non-capital letters, the encoding of ring closures with two matching digits, and the need to match every open parenthesis with a closing one for substituents. Suggestions have been made to overcome some of these problems by changing the SMILES grammar for this purpose. However, a principal problem remains: LSTM networks are not optimal for polycyclic structures, because the underlying data structure is just a sequence of characters.

    DataWarrior uses a different principle for the generation of random structures: Starting from a methane molecule it performs many random mutations on the molecule until its size reaches a desired molecular weight range. These mutations include atom additions, atom insertions, ring closures, bond order changes, atom changes, atom removal, and similar small modifications of the molecular structure. The algorithm favors mutations, which cause drug-like (or natural-product-like) sub-structures. It also prefers changes that let the molecule grow until its non-hydrogen atom count reaches a desired range.

    Optionally one may define a structure to start with. In addition, one may select a part of the starting structure, which would then be protected against mutations. This selected substructure would then be retained during molecule growth and, thus, would be part of every generated random molecule. To open the dialog for defining the random molecule generation process Generate Random Molecules... from the Chemistry menu.

    Dialog configured to generate random 3-amino-4-thio-azetidinones.

    Create compounds like: Currently, this option lets you choose among the generation of drug-like or natural-product-like structures.

    Nitrogen and oxygen bias: These slider let you increase to decrease the likelyhood to introduce nitrogen or oxygen atoms, respectively.

    Seed compound: This defines that starting molecule for the algorithm. If you intend to generate molecules with a certain substructure, then you need to draw a start molecule that contains this substructure and at least one more atom. Then use the lasso tool to select the substructure, which causes the algorithm to protect selected atoms and bonds from being mutated. All generated molecules will then contain the selected substructure.

    Molecule count: The number of molecules being generated.

    Minimum and maximum non-H atom count: These fields will influence the size of the generated molecules by defining a desired range of non-hydrogen atom counts.

    Molecule size distribution: Here one may select, whether generated molecule sizes will be evenly distributed within the defined range, or whether the majority of generated molecules will have non-H counts being around the middle of the defined range.

    Generated molecules with 3-amino-4-thio-azetidinone substructures.


    Analysing Scaffolds

    The Scaffold Analysis locates the core structure(s) of every molecule within a given column and creates a new column that contains these scaffolds. The method used to locate the core structure(s) depends on the chosen Scaffold type:

    • Plain ring systems: This mode locates all single ring and annelated ring systems without any substituents.
    • Ring systems with substitution pattern: This mode works as the previous one, but marks every ring atom as being substituted, which carries an exeo-cyclic, non-hydrogen substituent in the original molecule.
    • Ring systems with carbon/hetero subst. pattern: This mode goes a step further by distinguishing, whether a substituent's first atom was a carbon atom or a hetero atom.
    • Ring system with atomic-no subst. pattern: This mode is even more specific. Every exocyclic substituent is represented by its first atom.
    • Murcko scaffold:The Murcko scaffold contains all plain ring systems of the given molecule plus all direct connections between them. Substituents, which don't contain ring systems are removed from rings and ring connecting chains.
    • Murcko skeleton: The Murcko skeleton is a generalized Murcko scaffold, which has all hetero atoms replaced by carbon atoms.
    • Most central ring system: As the name implies, this is that ring system of the molecule, which is closest to its topological center. It does not contain any exocyclic substitution information.

    The image below illustrated the different scaffold modes. The original molecule structure is shown at the top middle position. The scaffold structures produced by any of the seven scaffold modes are shown around the original molecule.

    If the Save scaffold frequency file option is selected, then DataWarrior creates a new document listing all detected scaffold and their occurrence frequency. The name and location of the scaffold file can be set after pressing the Choose button.


    Decomposing R-Groups - Creating SAR-Tables

    Relating chemical structures to their biological effects is a fundamental and frequent task in drug discovery projects. Typically, the influence of individual scaffolds and in particular of their substituents on the target binding affinity are investigated. For this analysis many compounds with similar scaffolds, but changing substituents are synthesized and their biological activity is measured. Then a data table is generated that contains the chemical structures of scaffold, sometimes linker(s), and substituents in separated columns. This way the influence of different substituents at one position, also called R-group, can easily be correlated with the corresponding biological effects. The hope is that the influences of different R-groups at one scaffold position is more or less independent of R-groups at other positions and of minor changes of the scaffold itself. Of course, this implies that the general binding mode of the compound to the target molecule stays unchanged.

    DataWarrior can perform a so-called R-group deconvolution or R-group decomposition to generate a Structure Activity Relationship (SAR) Table directly from a table of chemical structures. This can be done either automatically or more flexibly with some user guidance.

    To create a SAR-Table from your dataset, select Decompose R-Groups... from the Chemistry menu. A dialog lets you choose the structure column to be analysed. It also asks you to define the mechanism that is applied for every molecule from the structure column to determine the scaffold or core structure:

    • Most central ring system: As the name implies, with this option that ring system of the molecule, which is closest to the topological center of the molecule, is considered that molecule's scaffold. Compounds without rings are neglected in this mode and their cells remain empty in the new columns.
    • Murcko scaffold: The Murcko scaffold of a molecule is determined by locating all ring systems of the molecule and all atom chains that directly connect these ring systems. Everything else is considered a substituent. As with the previous mode, compounds without rings have no Murcko scaffold and are not processed.
    • Custom sub-structure(s): For more flexibility in defining, which sub-structures should be considered a molecule's scaffold, use this mode. It requires a little more work, because you need to define substructure fragments, which, when found in a molecule, exactly define those atoms that constitute the scaffold in the following analysis. Of course, these substructures may include atom or bond query features. For instance, by using atom wildcards or variable bond bridges, a drawn sub-structure may detect multiple similar scaffold structures at once. This has the advantage that R-group positioning and numbering will be consistent for all scaffolds found by the same substructure query. If your data set contains multiple substancially different scaffolds, then you may define multiple substructures to cover all scaffolds in one SAR-analysis. Alternatively, you may run the Decompose R-Groups... task multiple times until all distinct scaffolds have been processed.

    "Decompose R-Groups" dialog with custom scaffold using atom list and bridge bond.

    Whatever mode you use for generating a SAR-Table, in a first step the task will analyse the entire dataset to identify all unsubstituted, raw scaffold structures that are present in the dataset. These may either consist of simple or more complex ring systems or may be basically any structure as result of a substructure match. Once a list of all distinct scaffold structures that exist in the dataset, is compiled, the entire dataset is processed again to determine for every one of these raw scaffold structures, which positions carry substituents in at least some of the matching molecules. At all exit vectors, for which different substituents were found, a numbered R-group is attached to the formerly unsubstituted raw scaffold. In addition, if a scaffold atom always carries the same substituent, e.g. a methyl-group, then this substituent is not considered an R-group and is also directly attached to the raw scaffold. After all R-groups and constant substituents have been attached to all scaffolds found, a new column is added and filled with the correct scaffold for every row. Then, for every varying scaffold position a numbered R-group column is added, properly filled with that R-group structure that any row's molecule carries at the corresponding exit vector position. Where R-groups are attached to a stereo center or to an exocyclic double bond, stereo topicities are determined and correctly represented in the scaffold structures used.

    The picture below shows a data table with original Structure column on the left and new Scaffold and R-Group columns generated by an R-Group Decomposition using the custom substructure shown above. Worth to mention are:

    • The query structure contains two wild card elements, a bridge bond with 0...2 atoms and an atom list with nitrogen and oxygen. Because of these, that single query structure matches onto three different scaffold structures, piperidine (rows 1,2), tetrahydrofuran (row 3), and tetrahydropyran (rows 4,5,6). Every one of these scaffold structures is analyzed independently concerning variation of substituents in all the matching rows. At exit vectors with differing substituents, an R-group is attached. The same decorated scaffold structure is then shown in all matching rows, e.g. the decorated tetrahydropyran in rows 4,5, and 6.
    • All decorated scaffold structures that result from the same query, use the same R-group numbering, even if the ring sizes are not the same, see scaffolds in rows 1 to 6.
    • When exit vectors at scaffold substructures are checked for substituents, in case of potentially created stereo centers (or double bond stereo configurations) stereo topicities are tracked and correctly handled, e.g. substituents at the R2 and R3 positions are correctly distinguished.
    • Detached substituent structures in the R-group columns carry an attachment point where they have been attached to the original molecule. Attachment points have an atomic number of 0 and are shown as with a '?' as atom label.
    • If a substituent at an R-group position connects back to the original molecule, i.e. if we have a ring closure and the substituent is actually a chain connecting two R-group positions, then the back-connection is indicated by another pseudo atom with atomic number 0, which instead of an atom label shows the R-group number of the back-connection, e.g. R2 and R3 in row 6.

    Table view with new columns added after SAR-Analysis

    Note: After a SAR-analysis, when chemical structures are split into multiple columns containing scaffold and R-group structures, it may be useful to re-unite, i.e. merge, some R-group columns with the scaffold structure. This reduces dimensions and may be useful to better investigate, visualize, or focus on the influence of the remaining substituents. Scaffold and R-group columns can be merged in a chemically correct way with the Merging Columns functionality.


    Similarity Analysis

    In the recent literature1-3 the terms Molecular Similarity Analysis, Activity Cliff Analysis or Activity Landscape are hot topics. All these related methods have in common that they usually start with a 2-dimensional scaling process of the chemical space, which means that all involved molecules are positioned somehow on a 2D-area, such that similar molecules are located close to each other. This scaling could be done by running a principal component analysis (PCA) on a descriptor of the molecules and using the first two components as coordinates. Another approach would be a self organizing map (SOM) from a descriptor. Both of these options are limited in terms of the descriptor type, because they require input data to be vector, i.e. a binary or numerical array of data.

    While DataWarrior allows running PCAs or SOMs on descriptor vectors and visualizing the results as a chemical landscape, the Similarity Analysis is based on a different method. It uses a Rubberbanding Forcefield4 approach, which translates similarity better than a PCA, is faster than a SOM, uses the available space more efficiently and works with any type of similarity criterion including the Flexophore descriptor.

    The approach involves the following steps:

    • randomly position all molecules on the 2D space
    • calculate the entire similarity matrix between all molecules
    • locate most similar neighbors to be considered for every molecule
    • between any two neighbors assume attractive forces, which increase with similarity and distance
    • stepwise relocate all molecules parallel to the mean vector of perceived forces
    • while attractive forces decrease over time and due to lower distances, introduce increasing short range repelling forces among all molecules

    Three default views after similarity analysis

    When DataWarrior has finished the calculation of molecule positions, it creates three new default views:

    • A view depicting the chemical space of all molecules. Similar neighbors are connected with a connecting line and the markers that represent the molecules are colored dynamically by molecule similarity to the chosen Reference Molecule, which changes whenever you click another marker.
    • A tree view that shows the direct neighbors of the chosen Reference Molecule. When a marker or molecule is clicked on in any view, the Reference Molecule changes and the tree view's content is dynamically updated to show the neighborhood of the new molecule.
    • A structure view, which is configured to show selected molecules on top, while the non-selected ones are grayed out. The highlight mode of the respective structures column is set to Reference Row Similarity, causing any displayed molecule to show any structural differences to the molecule of the Reference Row. Structural elements possessed by the reference molecule, which are not part of the depicted molecule, are shown in red. Structural elements of the shown molecule, which are not present in the reference molecule, are highlighted with a blue background. To change the selection of displayed molecules, you simply need to select different markers in the tree view or on the similarity map.

    Since a Similarity Analysis is very much related to a Activity Cliff Analysis, more information about how to configure and run a similarity analysis can be found at the end of the next section.

    1) Peltason L, Bajorath J; Molecular similarity analysis uncovers heterogeneous structure-activity relationships and variable activity landscapes.; Chem Biol., 2007, 14 (5), pp 489-97
    2) Guha R, Van Drie J H; Structure-Activity Landscape Index: Identifying and Quantifying Activity Cliffs; J. Chem. Inf. Model., 2008, 48 (3), pp 646-658; DOI: 10.1021/ci7004093
    3) Bajorath J, Peltason L, Wawer M, Guha R, Lajiness M S, Van Drie J H; Navigating structure-activity landscapes; Drug Discovery Today, 2009, 14 (13-14), pp 698-705
    4) Sander T, Freyss J, Korff M v, Rufener C; DataWarrior: An Open-Source Program For Chemistry Aware Data Visualization And Analysis; J. Chem. Inf. Model., 2015, 55 (2), pp 460-473; DOI: 10.1021/ci500588j


    Activity Cliff Analysis

    The Activity Cliff Analysis uses the same mechanism already explained in the previous section to create a similarity map of all involved molecules. It also detects all similarity relationships between them above an automatically determined similarity threshold. To be precise, this is not a global cutoff value, but is modulated from molecule to molecule. Depending on the neighborhood situation of an individual molecules the threshold may be increased or decreased to accound for many very similar or few not even similar neighbors. This reduces singletons and untangles large clusters to some extend.

    In addition to the Similarity Analysis the so-called Structure-Activity Landscape Index (SALI) is calculated for all pairs of similar molecules. If two molecules with measured activities a1 and a2 and their structural similarity being s, then the SALI value between these molecules is defined as SALI = |a1-a2| / (1-s). The SALI value is a measure of how much activity is gained (or lost) with a relatively small change in structure. Molecule pairs that show an abrupt change in activity despite having a rather similar structure are called activity cliffs. These pairs are particularly interesting, if one tries to understand structure-activity relationships in order to design new structural motives with improved activities.

    After an Activity Cliff Analysis the generated similarity view encodes SALI values and activites in marker size and marker color, respectively. The image above shows a part of such a similarity map. In this case the dataset contained EC50 values on Cannabinoid CB1 and CB2 receptors. The marker background color reflects the receptor subtypes (CB1:pink, CB2: orange). One can easily recognize clusters of similar compounds, locate active compounds (red markers), locate activity cliffs (large markers), and even distinguish CB1 from CB2 inhibitors.


    Configuring And Running A Similarity Or Activity Cliff Analysis

    To perform a Similarity or Activity Cliff Analysis choose Analyse Similarity/Activity Cliffs... from the Chemistry menu. The following dialog appears:

    Similarity Analysis Dialog

    Similarity on: Defines the similarity criterion, i.e. the descriptor that is used for arranging molecules on the 2D-map. One may use any descriptor that DataWarrior knows of, provided that is has been calculated previously for the current data file. Most useful descriptors are SkelSpheres for fine-grained chemical graph similarity, OrgFunctions for similarity on synthetically relevant organic functionality, and Flexophore to create a molecule map based on the similarity of protein binding characteristics.

    Activity column: For a Similarity Analysis don't select a column here. For an Activity Cliff Analysis you need to select that column that contains the numerical value to calculate SALI values from. For any pair of molecules the SALI value reflects how much activity is gained with a small change of the chemical structure. Very high SALI values identify activity cliffs, i.e. those rare points in an activity landscape, where a small change of the chemical structure causes a large change in activity (or any other experimentally determined molecule property. Identifying these molecule pairs and understanding the structural cause of the activity change can be very helpful in the process of designing compounds with better properties.

    Identifier column: The Similarity or Activity Cliff Analysis detects for evey molecule its most similar neighbor molecules and writes a reference to those molecules into a new column. Therefore it needs a column that contains a key that uniquely identifies a molecule or data row. If your data contains compound identifiers you may select that column. Otherwise, DataWarrior will create a new number for that purpose.

    Separate groups by: In some cases one column contains data experimental data refering to multiple targets or measured under different conditions. If a second column contains categories describing the conditions, and if only values within the same category can be compared to each other, then you should select the category column here. Then SALI value will only be calculated from compatible experimental values.

    Similarity limit: Usually Automatic does a good job. However, if you prefer getting more or less neighborship relationships than the automatic process generates, then you may disable the automatic setting and (moderately) update the threshold defining slider. If the limit is set too high then this may cause the 2D-scaling find too little similarity relationships. The final map may then not be much different from the initial state of randomly scattered molecules. If the limit is set to low and therefore too many similarity pairs are found, then a highly interconnected bunch of molecules won't equilibrate well.

    Create view based on similarity relationships: If this option is checked, a similarity map of all molecules is created. Therefore a Rubberbanding Forcefield is employed to incrementally equilibrate 2D-coordinates for all molecules until an energy minimum is reached and all molecules are positioned close to their most similar neighbors. Afterwards a 2D-view is created to visualize the similarity map.

    Create document of structure pairs: If this option is checked, then DataWarrior creates a new document in an open window, which contains all detected similarity relationships in dedicated rows. Two columns contain the two neighbor molecules; additional columns contain molecule identifiers, similarity, activities, and SALI values.


    Continue with 3D-Chemistry...