DataWarrior User Manual

Accessing External Databases

DataWarrior, as its name already implies, lives from data. Since the amount of structured data being freely available over the internet is increasing daily, DataWarrior has functionality built in to access some of the most obvious data sources directly. Our intention is to make more data sources available with future software updates.


Wikipedia is undoubtedly one of the most important freely available sources of knowledge on the internet. Many thousand of its articles are about chemical substances. DataWarrior allows to retrieve an up-to-date list of all Wikipedia compound structures with their names and formulas into a local DataWarrior file. For that choose Retrieve Wikipedia Molecules from the Database menu. This offers an easy way to search or otherwise process the chemical content of Wikipedia offline. Within the local file every compound has a link back into the corresponding Wikipedia web page. A right mouse click within DataWarrior opens a popup menu that allows to open the compound's Wikipedia article in the web browser.

The ChEMBL Activity Database

The ChEMBL database is one of the most widely used compound activity databases available. DataWarrior permits to query the database by specifying search criteria like biological targets, pubmed IDs, or chemical structures. Structure searches may be based on chemical similarities, contained substructures, structure equivalence or tautomers.

ChEMBL Database Query Dialog

In the following the ChEMBL database query dialog is explained in more detail:
Target Contains: This field allows to quickly filter the target list by target name, gene accession number, organism or other available target related information. One may specify multiple search phrases as for instance "renin rat".
Hierarchical Protein Family Filter: In the ChEMBL database many target proteins are assigned to a small protein family. Multiple small related protein families are grouped on a higher level into a less specific larger family. Related larger families are again grouped into an even larger one and so forth. This forms a hierarchical tree of related proteins. This filter allows to filter targets by first selecting a coarse protein family and then successively selecting subfamilies of the previously selected protein family.
Select Target(s): This text area contains a list of targets that are available in the ChEMBL database and not filtered out by any filter criteria. If one or multiple targets of this list are selected, then the search will only retrieve activity values on these targets.
Target Detail: If one target is selected in the target list above, then this text area shows some detail about the target, e.g. the name, UniProt accession number, type, organism and its protein family classification.
Structure search options: For running a structure search as part of the ChEMBL database search one can either specify (draw, paste or drag&drop) a structure in the Structure field of the dialog or you may select one or more structure in a DataWarrior window before opening the ChEMBL search dialog. The type of structure search can be selected from a dedicated menu among sub-structure, exact match, non-stereo specific match, tautomer match and structure similarity, which would activate a slider to define the similarity limit. If some structures were pre-selected and the option any selected structure is chosen, then a database compound is considered a hit, if its structure matches any of the selected structures.
Pubmed-ID(s) or DOI(s):Here one can specify one of multiple source papers separated by comma, semicolon or space. For most papers the ChEMBL database contains Pubmed-IDs, while the DOI is often missing. Thus, searching for Pubmed-IDs is much more likely to yield useful results than searching for DOIs.
Group results with same compound, target, and result type:If this option is selected then all results from the same target and same chemical structure and merged into one result row. Within these rows individual result values appear in separated lines within one table cell.
For running a query one needs to specify at least one of the three kinds of search criteria, targets, (a) compound structure(s), or paper references. If your intention is to download the entire database, you should do that from the ChEMBL web site.

The Crystallography Open Database (COD)

The COD is probably the most comprehensive open database for crystallographic structures of small molecules. It is a valuable resource for studying conformational aspects of molecules like typical torsion angles, bond lengths and atom distances. The COD contains more than 360.000 organic and organo-metallic structures. DataWarrior uses conformational knowledge extracted from the COD for its conformation generation algorithm.

DataWarrior allows to run sub-structure queries on the COD-database, which returns the matching 3D-structures along with meta data. The picture below shows a DataWarrior window with organic structures from the COD and a form view containing a large 3D-structure view.

COD database entry showing the torsion angle of an sulfonamide

If you have measured or will create X-ray structures from small molecules, please consider uploading them to the COD. Open source software needs open data and the well known Cambridge Structural Database does not qualify as open data, since neither its structures nor any extracted information may be used as part of open source software.

Data From Custom URL

The Internet is full of WebServices that offer the retrieval of data tables of any kind typically in TAB delimited or other formats. For instance, the Open Source Malaria Project maintains data in a Google spreadsheet, which can be accessed in TAB delimited format via this URL:
One of its columns contains the chemical structures in SMILES format. DataWarrior's Database menu contains an item Retrieve Data From URL to retrieve data from such web resources. Whenever a column of such a web resource contains SMILES codes, then DataWarrior recognizes them and creates an additional column showing the associated chemical structures. If one frequently retrieves updated data from the same URL, then it may be a good idea to create a short macro with the URL retrieval task and save it in the macro folder within the DataWarrior installation folder. This way the Macro menu contains an item to directly retrieve the data into a new DataWarrior window.

Accessing Relational Databases

The majority of all databases are relational databases, which can be accessed using a programming language called Structured Query Language (SQL). SQL is used to define the tables and columns from which to retrieve data, to define the query conditions and how to logically join tables, when information from mutliple tables needs to be retrieved. For accessing these databases directly from DataWarrior one needs to spcify the so-called connect-string or connection URL and the SQL command to execute.

SQL-Query to retrieve all reliable activity data from the ChEMBL database

The screenshot above shows the SQL-Query dialog configured to retrieve data from a ChEMBL database in MySQL-format running on a server with the domain name In addition to MySQL DataWarrior also supports Oracle databases. The SQL statement shown is a typical example. After the SELECT keyword it defines a few columns from various tables to retrieve data from. The primary table activities after the FROM keyword is logically bound to other tables with the JOIN keyword and after the WHERE keyword some conditions are defined that specify and limit the data to be retrieved.

Note that one of the retrieval columns contains chemical structures in SMILES format. These are automatically recognized by DataWarrior and an additional column containing the corresponding chemical structures is generated.

Continue with Automation with macros...