DataWarrior User Manual

Loading Data into DataWarrior


Apart from its own native file formats, DataWarrior also reads and writes TAB-delimited and comma-separated text files as well as SD-files, which are the de-facto industry standard for exchanging chemical information. In addition to reading data from files, data may be pasted from the clipboard or retrieved from databases. After reading data from any source, DataWarrior analyses every column to understand the kind of data it contains, i.e. whether it contains numerical and/or category data, whether the row contains empty values, and more. It also checks for correlations and creates default views and filters.

If DataWarrior was installed correctly, then every file type discussed in this section should have a proper icon assigned and double clicking a file's icon should result in DataWarrior opening the file. This section explains the interaction with files and the clipboard.


Native DataWarrior Files

Whenever you save any data from DataWarrior to open it later from the same application, then a native DataWarrior file ending with .dwar is the preferred file type. In addition to the plain data, .dwar files may contain the following kind of information:

  • Which views are visible, how they are arranged and what they display
  • Which filters are visible, how they are configured and, thus, which data rows are visible
  • Which row lists are defined and which rows belong to them
  • An HTML based text describing the file's content
  • Column related description of how to interpret a column's content, e.g. lookup URLs
  • Cell related detail data like formatted text and images
  • Keys and links for on-the-fly retrieval of cell related detail data from external sources
  • Hidden columns with molecule data such as descriptors and atom coordinates
  • Macros that allow to completely automate DataWarrior (from version 4.0)
  • To open a native DataWarrior file, choose Open from the File menu or just double-click an icon representing a .dwar file.


    DataWarrior Template Files

    A DataWarrior Template file contains the complete configuration of views and filters, as they have been, when the Template file was saved. If you want to store the current state of views and filters of an open DataWarrior window in order to possibly restore it later with the same or another dataset, you may save a Template file. To re-apply a formerly stored template to an open DataWarrior window, choose Open Special -> Apply Template... from the File menu. You may then select either a .dwat or a .dwar file. In both cases the template will be read from the file and all views and filters will be replaced by new ones as defined in the file.


    DataWarrior Macro Files

    DataWarrior version 4.0 and above support recording, editing and replaying entire workflows. These may be stored as part of a native DataWarrior file or can be exported into a dedicated macro file. Similar to templates you may run a macro by opening a dedicated macro file with Open Special -> Run Macro... from the File menu.


    DataWarrior SOM Files

    By creating a self organized map (SOM) DataWarrior can position chemical molecules or other objects on a two dimensional area in a way, that any object's closest neighbours in the plane are those objects that are the most similar ones in the dataset. A calculated SOM is actually a 2-dimensional grid of reference vectors of which everyone resembles one or more molecules/objects of the dataset. Once these reference vectors are calculated, the objects are one by one assigned to that reference vector, which is the most similar to the object. If one intends to map a second set of objects from an external file to a previously calculated SOM, then these vectors must have be available. For that reason they can be saved as SOM file, which can later be used to map external objects, which is effectively creating compatible 2-dimensional object coordinates.


    DataWarrior Query Files

    A .dwaq file or Query File does not contain any data. It rather contains a database query that is performed when the file is opened. Moreover, it may contain the template information needed to construct certain views and filter settings after the query result data has been retrieved. Query files are used if data in a database is frequently changing or to confidentially communicate new results, e.g. via e-mail. To open a .dwaq file, select Open Special -> Run Query... from the File menu, or double-click the icon representing the file.


    SD-Files

    SD-Files are the de-facto industry standard for exchanging chemical structures and associated alpha-numerical information. It has been developed and published by Molecular Design Ltd. (MDL). The version most widely used is version 2, which has limited support for stereo chemistry: A so-called chiral flag defines for the entire molecule, whether it is a racemate of a mixture of enantiomers. With version 2 SD-files it is not possible to define epimers, mixtures of diastereomers, etc. In order to tackle the deficiencies, MDL introduced an updated concept along with an updated file format: Version 3. DataWarrior consistently uses this new concept, which allows to define for any stereo center within a molecule, whether it is absolute or whether it belongs to a group of stereo centers with a specific relative stereo configuration.

    From the File menu, select Open... and use the dialog window to select the SD-file(s) (the file extension is .sdf) to import. DataWarrior reads the entire content of the SD-File, displays rows in the Table View, creates default 2D- and 3D-Views, a Structure View and generates a structure index (FragFp descriptor), which is needed internally for some structure related tasks. While the indexing process is underway and its progress bar is visible in the status area, these functions e.g. sub-structure search are not yet available.


    Text Files

    TAB delimited and comma separated text files ('.txt' and '.csv') are among the most portable file formats because they can be created by many programs. In these text files each line represents a row and all fields within the row are separated by TABs or commas. In case one or more columns of the text file contains chemical structures in SMILES format, then DataWarrior automatically recognizes them and creates an additional column with chemical structures for every SMILES containing column. From the File menu, select Open Special and choose Textfile...


    Example Data Files

    In the standard DataWarrior installation, the File menu contains two submenus with direct access to some example files. The option Open Reference File covers various files with chemical structures and related data, e.g. known drugs, pKa values, bioactive compounds, and other datasets of interest. Open Example File provides examples that illustrate non-chemistry related aspects of DataWarrior. Depending on the installation, further submenus may provide quick access to files in user defined directories.


    Paste Data from the Clipboard

    If you copy tabular data from any text editor or spreadsheet application, you may paste it directly into DataWarrior. This will open the data as if it were loaded from a text file. By analyzing the data DataWarrior will try to evaluate, whether a header row is present. If it believes that there is none it will generate default column names.

    In most cases DataWarrior will correctly predict, whether the clipboard content starts with a header row. If it fails because of insuffucient clues, then one may use one of the Paste Special options to hint that a row header is present or not.

    In the following example some data was selected within a spreadsheet application and then copied to the clipboard with Ctrl-C.

    After switching to DataWarrior and after choosing Paste (Ctrl-V) DataWarrior responds by displaying the clipboard's content in a new window. It has recognized the column named "Smiles" to contain valid SMILES codes and automatically created an additional column with chemical structures from the SMILES strings. It also created two graphical default views and, since the data now contains chemical structures, it also created a dedicated structure view.


    Importing Data From Databases

    Depending on your particular version DataWarrior is able to directly retrieve data from a variety of databases. These include:

  • Compounds & Activities from the ChEMBL database
  • 3D-Structures from the Crystallography Open Database (COD)
  • All chemicals structures from Wikipedia
  • Data from any Web-based data source in TAB or comma delimited format, e.g. Google spreadsheets
  • At Actelion the following additional database options exist:

  • Chemical and biological data from the Osiris Database
  • Compounds and data from Actelion's Chemical Inventory
  • Compounds & products from the Commercial Chemicals Database
  • Compounds from the Screening Compounds Database
  • Protein Crystallization Data
  • Micro Array Data
  • Gene Expression Data

  • Continue with Main Views...