## Analysing Data

DataWarrior supports you with various means in the task of exploring and comprehending large datasets, be it with or without chemical structures.

*Pivoting Data*

Consider a data table with three columns, of which the first contains product numbers,
the second column contains a year and the third the profits gained with this item in the given year.
Now imagine you intend to correlate profits between the different years. Then you need the data
table to be arranged differently, i.e. you would need rows to contain products, columns to contain
years and the individual cells to contain the profit values achieved. Column titles would
look like 'Product Number', '2010', '2011', '2012', ... This kind of data transformation technique
is called *Pivoting* or *Cross-Tabulation*. This data transformation can be done in
*DataWarrior* by choosing in the
menu, which lets you define involved columns in a dialog and
then creates a new window with the copied and converted data table.

The example used a file with Dopamine receptor antagonists from the ChEMBL database. Besides
various other information it contained chemical structures, measured effects and receptor subtype
(D1 - D5) in separate columns. Columns were selected to group results in one row per chemical structure
and to split data column into separate columns for every receptor subtype. Three data columns
were selected, the measured numerical *value*, the *unit* and the *m* column,
which sometimes contains a modifier (e.g. '<' symbols).

*Reverse Pivoting Data*

The *Reverse Pivoting* transformation is indeed the opposite of the pivoting conversion.
It creates a new *DataWarrior* document from the original data table. For every original data row
it takes the cell content from multiple selected data columns into one new table column, but multiple
new rows. A second column is created, which receives the column names of the original data columns.
This way every new row is assigned to one of the original selected data columns.
Original cell content of other non-selected columns is redundantly copied to any new row.

The *Reverse Pivoting* dialog lets one select those column that shall be merged into one
new column. One also needs to specify a column name for this column as well as for the second new
column, which receives the original column names and serves to assign rows to original categories.

*Binning Numerical Data*

If the statistical distribution of all numerical values within one column is of concern,
and if a histograms shall be displayed, then one needs to create a new column from the
numerical values, which assigns a bin to each value, i.e. converts the numerical column
into a special kind of category column.

The from the menu
opens the following dialog, where the bin size and the exact location of the first bin can be
defined, either by moving sliders or by typing in the desired values. A histogram preview shows how
all rows are distributed among the bins with the current settings.

*Binning dialog and histogram view from created new column.*

*Calculating A New Column From An Expression*

Often some columns of a dataset contain alphanumerical data, which if combined together, or if converted by a calculation or in some other way, would make it available for filtering, view adaption or other purposes. Examples would be the calculation of a ratio between two column, doing a conditional number conversion depending on an objects category or calculating a fitness value from other properties.

For calculating a new column from existing column values DataWarrior has built in the open source expression parser JEP. It lets you define an arithmetical, logical or textual expression from constants, operators (+, -, *, >, &, !, ==, etc.), functions as sin(), sqrt(), if(), contains(), etc. and variables that represent data from other columns. In addition to the standard functions, DataWarrior has some special functions defined to calculate molecule similarities, ligand efficiencies, or occurance frequencies of cell values. To define and evaluate a new expression choose from the menu. When typing in a formula you don't need to type variable names. You rather should select the column name from the selection in the dialog and click . This ensures correct spelling, especially since some column names are slightly modified to avoid syntax errors when the expression is parsed. The Help button opens a window explaining the expression syntax, all operators and available functions.### The Gini Selectivity Score

The Gini Coefficient was originally invented to represent the income or wealth distribution of a nation's residents, and is today the most commonly used measure of inequality. For instance, in the context of medicinal chemistry P. Graczyk applied it to express the selectivity of kinase inhibitors (doi:10.1021/jm070562u). A Gini Coefficient of 0.0 expresses perfect equality, where all values are the same, while a value of 1.0 represents maximal inequality, e.g. a totally selective inhibitor that is only active on one of many kinases. Typically, the Gini Coefficient is used for a large number of positive numerical values of which a few are much larger than the rest. Thus, inhibition values are a perfect input, while logarithmic data is a less meaningful data source.

To calculate the Gini Coefficient in DataWarrior one needs to provide the input values in columns. The item in the menu opens a dialog that lets you define a set of columns by either selecting them indiviually from a list or by specifying a common element of the columns' names. In addition one may select an option that converts effect values to inhibition values by subtracting the original value from 100. DataWarrior then calculates a new column containing the Gini Coefficient using the trapez method as described in the paper by Graczyk. However, DataWarrior does not change input values to limit the value range to 0 to 100, because this would reduce the noise in an unsymmetrical fashion. We consider the Gini Coefficient robust enough to digest negative input data without sacrificing its usefulness.

### Defining a Fuzzy Score

Sometimes it is helpful to express the fitness or usefulness of an object with a single numerical value, which somehow summarizes the fitness of multiple properties. For this purpose DataWarrior allows to calculate a fuzzy score from multiple contributing property values. These values may be taken from numerical columns of the data set or they may be calculated on-the-fly from a chemical structure. No matter, which individual properties are used, DataWarrior derives from every contributing property an individual score between 0.0 and 1.0. This is done using a smooth sigmoid curve defined for every property. In the dialog (see below) this curve is drawn in magenta color. The curve's slope and inflection point(s) can be adjusted by setting the and values as well as by moving the slider. An additional slider allows to reduce or increase the contribution of the associated property's score to the overall score.

The individual valuation scores are then summarized by calculating their mean value or by multiplying all individual scores (geometrical mean). Usually the latter method is chosen, if an object can be considered useless, if one bad property alone can ruin it.

The properties that contribute to the overall fuzzy score are compiled by adding property specific panels to the lower scrollable part of the dialog. This is done by selecting the property's name in the

popup menu and pressing the button. Then a new panel appears at the end of the property panel list letting you define the valuation curve. If the property refers to the current data set, then the distribution of the associated column's data is shown as a histogram in the curve area. This helps defining proper valuation curves.Defining a fuzzy score profile is often done once and then applied multiple times on the same, but updated data set or on different data sets. Therefore, this functionality is often used within a macro.

*Fuzzy score dialog defined to reward low IC50 values and low molecular weights.*

### Principal Component Analysis

The Principal Component Analysis
(PCA) is a widely used technique to convert a large number of
highly correlated variables into a smaller number of less correlated variables. De-facto it does
a coordinate transformation and re-positions the first axis of the coordinate system such that it perceives
the maximum possible variance of the multidimensional data. Then the second axis is positioned orthogonal
to the first one such that it perceives the maximum possible variance of the remaining dimensionality.
The third axis is again positioned orthogonal to the first two also maximizing the perceived variance
of the remaining data and so forth. In reality, to not overvalue those dimension with the highest numerical
values, *DataWarrior* normalizes and centers the values of every input dimension before applying
the coordinate transformation.

Often the first dimensions (or components) of a PCA cover much of the variability of the data. This is because in reality many dimensions of multi-dimensional data are often highly correlated. In chemical data sets, for instance, carbon atom count, ring count, molecular surface, hetero atom count are all highly correlated with the molecular weight and therefore almost redundant information. Often the first two or three dimensions taken from a PCA can be used to position the objects on a plane or in space such that clusters of similar objects can easily be recognized in the view.

Since many chemical descriptors are nothing else then binary or numerical vectors, one can consider the n dimensions of these vectors as n parameters of the data set. Therefore, the binary and SkeletonSpheres descriptors can be selected as parameters to the PCA. This allows to visualise chemical space, if the first calculated components are assigned to the axes of a coordinate system. In the example above a dataset of a few thousand Cannabinoid receptor antagonists from the ChEMBL database was used to calculate the first three principal components from the SkeletonSpheres descriptor. These were assigned to the axes of a 3D-view. The dataset was clustered with the same descriptor joining compounds until cluster similarity reached 80%, forming more than 300 clusters. Marker colors were set to represent cluster numbers. The distinct areas of equal color are evidence of the chemical space separation that can be achieved by a PCA, even though the dataset consists of rather diverse structures.

### t-SNE Visualization

T-distributed Stochastic Neighbor Embedding (t-SNE) is a machine learning algorithm for visualization developed by Laurens van der Maaten and Geoffrey Hinton. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

L.J.P. van der Maaten. *Accelerating t-SNE using Tree-Based Algorithms.* Journal of Machine Learning Research **15** (2014), 3221-3245.

*Typical visualization of a 2-dimensional t-SNE plot with about 20.000 data points*

The t-SNE calculation used in DataWarrior is based on the Java code developed in 2016 by Leiff Jonsson at Delft University of Technology. The source code is publicly available at https://github.com/lejon/T-SNE-Java.

As default DataWarrior applies a principal component analysis to reduce the number of dimensions to 50, if the number of selected input dimensions is higher, e.g. if a chemical descriptor is chosen as input variable. DataWarrior uses the Barnes-Hut implementation with a default perplexity value of 20.0 and a maximum of 1000 iterations. The number maximum number of input dimensions, the perplexity value and the number of iterations can be modified. A good overview of what these parameters mean and how they influence the outcome was available from Martin Wattenberg et al.

### Self-Organizing Maps

A Self-Organizing Map (SOM) also called
Kohonen Map
is a popular and robust artificial neural network algorithm to organize objects based on object
similarity. Typically, this organization happens in 2-dimensional space. For this reason SOMs
can be used well to visualize the similarity relationsships of hundreds or thousands of
objects of any kind. Historically, small SOMs (e.g. 10*10) were used in cheminformatics
to cluster compounds by assigning similar compounds to the same neuron. A typical visualization
showed the grid with every grid cell colored to represent number of cluster members or an
averidge property value. In order to visualize individual compounds of the chemical space
*DataWarrior* uses many more neurons and adds a sub-neuron resolution explained in the
following.

Similar to the PCA, the SOM algorithm uses multiple numerical object properties,
e.g. of a molecule, which together define an object specific n-dimensional vector.
The vector may either consist of numerical column values
or - as with the PCA - it may be a chemical descriptor. These object describing vectors are called
input vectors, because their values are used to train a 2-dimensional grid of equally sized,
but independent reference vectors (also called neurons).

The goal of the training is to incrementally optimize the values of the reference vectors
such that eventually

- every input vector has at least one similar counterpart in the set of reference vectors.
- adjacent reference vectors in the rectangular grid are similar to each other.

*Cannabinoid receptor antagonists on a SOM with 50x50 neurons*

The Cannabinoid receptor antagonists, which have been used in the PCA example, were
arranged on a SOM considering the *SkeletonShperes* descriptor as similarity criterion.
The background colors of the view visualize neighborhood similarity of adjacent neurons
in landscape inspired colors. Blue to green colors indicate valleys of similar neurons,
while yellow to orange areas show ridges of more abrupt changes of the chemical space
between adjacent neurons. Colored markers are located above the background landscape
on top of those neurons, which represent the compound descriptors best. This way very
similar compounds huddle in same valleys, while slightly different cluster are separated
by yellow ridges.

Please note that *DataWarrior* uses sub-neuron-resolution to position object on
the final SOM. After assigning an object to the closest reference vector, any object's
position is fine-tuned by the object's similarity to the adjacent neurons.

Comparing the SOM to the PCA concerning their ability to visualize chemical space,
SOMs have these advantages: SOMs use all available space, while PCAs leave parts of the
available area or space empty. The SOM takes and translates exact similarity values,
while the PCA uses two or three principal components only to separate objects and,
therefore, may miss some distinction criteria. High compound similarity is translated
well into close topological neighborhood on the SOM. However, vicinity on the SOM
does not necessarily imply object similarity, because close neurons may be separated by
ridges and represent different areas of the chemical space, especially if the number
of reference vectors is low. One may also check how well individual objects match
the reference vectors they are assigned to. For this purpose *DataWarrior*
calculates for every row a *SOM_Fit* value, which is the square root of the
dissimilarity between the row's input vector and the reference vector the row is
assigned to. As a rule of thumb, *SOM_Fit* values below 0.5 are a good indication
of a well separating SOM.

The SOM algorithm is a rather simple and straightforward one. First the input vectors
are determined and normalized to avoid distortions of different numerical scales.
The number of values stored in each input vector equals the number of numerical columns
selected or the number of dimensions of the chosen descriptor.
Then a grid of n*n reference vectors is initialized with random numbers.
The reference vectors have the same dimensionality as the input vectors.

Then a couple of thousand times the following procedure is repeated:
In input vector is selected randomly. That reference vector is located, which
is most similar to the input vector. This reference vector and the reference vectors in
its circular neighborhood are modified to make them a little more similar to
the selected input vector. The amount of the modification decreases with increasing
distance from the central reference vector and it decreases with the number of
modification rounds already done. This way a coarse grid structure is formed quickly,
while the more local manifestation of fine-granular details takes the majority of the
optimization cycles.

*DataWarrior*select from the submenu of the menu. The following dialog appears:

**Parameters used:**The columns selected here define the similarity criterion
between objects or table rows. All values are normalized by the variance of the column.
The normalized row values of all selected columns form the vector, which is used
to calculate an euclidian similarity to other row's vectors. If a chemical
descriptor is selected here, the SOM uses respective chemical similarities and,
thus, can be used to visualize chemical space.

**Neurons per axis:**This defines the size of the SOM. As a rule of thumb
the total number of neurons (square the chosen value) should about match the
number of rows. A highly diverse dataset will require some more neurons, while
a highly redundant combinatorial library will be fine with less.

**Neighbourhood function:**This selects the shape of the curve, which defines
how the factor for neuron modification factor decreases with distance from the
central neuron. The typical shape is a one,
which usually causes smooth SOMs.

**Grow map during optimization:**Large SOMs take quite some time for the
algorithm to finish. One way of reducing the time is to start out with a much smaller
SOM and double the size of each axis three time during the optimization. If
this option is checked, the map starts with one eightht of the defined neurons
per axis.

**Create unlimited map:**If this option is checked, then the left edge neurons
of the grid are considered to be connected to the respective right edge neurons
and top edge neurons are considered to be connected to bottom edge neurons.
This effectively converts the rectangular area to the surface of a torus,
which has no edges anymore and, therefore, avoids edge effects during the
optimization phase.

**Fast best match finding:**A large part of the calculation time is spent
on finding the most similar neuron to a randomly picked input vector.
If this option is checked, then best matching neurons are cached
for every input vector and are used in later as best match searches as
starting point. This assumes that a path with steadily increasing similarity
exists from the previous best matchin neuron to the current one.

**Show vector similarity landscape in background:**After finishing the
SOM algorithm, *DataWarrior* creates a 2-dimensional view displaying
the objects arranged by similarity. If this option is checked, a background
picture is automatically generated, which shows the neighbourhood similarity
of the SOMs reference vectors in colors inspired by landscape. Blue, cyan, green,
yellow, orange and red reflect an increasing dissimilarity between adjacent
neurons. Markers in the same blue or green valley belong to rather
similar objects.

**Use pivot table:**This option allows to pivot the data on the fly.
Imagine a dataset where gene expression values have been determined for a number
of genes in some different tissues. The results are stored in three columns
with gene names, expression values, and tissue types. For
you would select *Tissue Type*
and for *Gene Name*.
*DataWarrior* would then convert the table layout to yield one
*Tissue Type* column and one additional column with expression values
for every gene name. The generated SOM would then show similarities between
tissue types considering expression values of multiple genes.

**Save file with SOM vectors:** Use this option to save all reference
vectors to a file once the optimization is done. Such a file can be used
later to map objects from a different dataset to the same map, provided
that they contain the properties or descriptor used to train the SOM.
One might, for instance, train a large SOM from many diverse compounds
containing both, mutagenic and non-mutagenic structures. If one would use
the SOM file to later map a commercial screening library, one could merge
both files and show toxic and non toxic areas by coloring the background
of a 2D-view accordingly. The foreground might show the compounds of the
screening library distributed in toxic and non-toxic regions.

Continue with Chemical Structures...