Downloadable Data Files

The DataWarrior installation already comes with various sample data files. These include FDA-approved drugs, compound collections with physico-chemical properties or measured pKa values, kinase ligands, and a few files with non-chemical content to illustrate various program features.

This page contains download links to larger data files, which are not included in the DataWarrior installers, because they would significantly increase its size or because may not be of general interest.

  • Chemical reactions from US patents (1976-Sep2016)
  • X-Ray structures from the Crystallography Open Database
  • DrugBank 5.0.10

  • Chemical reactions from US patents (1976-Sep2016)

    The original reaction collection was extracted by Daniel Lowe using text-mining from United States patents published between 1976 and September 2016. These reactions are available as CML or reaction SMILES. The original file with about 1.8 million reactions encoded as reaction SMILES can be downloaded, unzipped using a program like 7-Zip, and then directly opened with DataWarrior V5.0.0 or newer.

    The reactions were extracted using an enhanced version of the reaction extraction code described in https://www.repository.cam.ac.uk/handle/1810/244727 with LeadMine used for chemical entity recognition.

    General tips: Duplicate reactions are frequent due to the same or highly similar text occurring in multiple patents, this is especially true when combining the applications and grant datasets, many reactions from applications will later appear in patent grants. Paragraph numbers are only present for 2005+ patent grants and patent applications. Multiple reactions can be extracted from the same paragraph. Atom maps in the reactions SMILES are derived using Epam's Indigo toolkit. While typically correct, the atom-maps are wrong in many cases and hence should not be entirely relied on.

    The reactions have been filtered to remove common cases of incorrectly extracted reactions: All product atoms must be accounted for by the atom-mapping. The product(s) must have >8 heavy atoms. The product must not be charged if it is a single component. The number of products must be <5 and number of reactants+agents<16.

    The file includes columns PatentNumber, ParagraphNum, Year, TextMinedYield, and CalculatedYield.

    Lit.: Lowe, Daniel (2017): Chemical reactions from US patents (1976-Sep2016). figshare. Fileset.

    The DataWarrior file that you can download from this website is a subset of the original file. It contains all original reactions with a text mined yield between 50% and 150%. These reactions can be directly searched on your computer by reaction sub-structures (i.e. transformation search), reaction similarity, or a Retron search. The latter search type is a simple, but yet unusual concept, which was suggested by Roger Sayle: One draws a sub-structure intended to be synthesized. A sub-graph search counts whether this sub-structure is found more often in the products than in the reactants. In this case the reaction is considered a match, because then the sub-structure must have been built in the reaction. (File size: 172MB, unzipped:337MB, 508'850 reactions)

    DataWarrior applying reaction similarity filter


    March 2018 Snapshot of the Crystallography Open Database (COD)

    Since version 4.2.2 DataWarrior is able to generate conformers. The algorithm uses a combination of self-organization and rule-based approach. The latter is based on statistical data derived from a large number of 3-dimensional, diverse, organic structures from a crystallographic database. The de-facto standard source for organic, crystallographic molecule structures would be the Cambridge Structural Database (CSD). Its license, however, does not permit to derive and publish geometrical statistical data as part of an open source package. Luckily, there is an open alternative, the Crystallography Open Database (COD). While this database consists of one CIF file per structure, Saulius Grazulis and Antanas Vaitkus from the COD have built an automatic procedure to convert the database into a format that is more suited for cheminformaticians using Perl, Java and OpenChemLib. Here you may download a COD snapshot with 215,995 quality-checked 3D-structures in DataWarrior format (112,670 organic, 94,913 metalorganic, 8412 inorganic structures, 292.7 MByte, COD snapshot, March 24, 2018).

    DataWarrior displaying a COD entry


    DrugBank Version 5.0.10 (Subset in DataWarrior format)

    The DrugBank database is a unique bioinformatics and cheminformatics resource that combines detailed drug (i.e. chemical, pharmacological and pharmaceutical) data with comprehensive drug target (i.e. sequence, structure, and pathway) information. The database contains 8250 drug entries including 2016 FDA-approved small molecule drugs, 229 FDA-approved biotech (protein/peptide) drugs, 94 nutraceuticals and over 6000 experimental drugs. This DataWarrior file is a subset of drugbank 5.0.10 downloaded from https://www.drugbank.ca. DrugBank is offered to the public as a freely available resource. Use and re-distribution of the data, in whole or in part, for commercial purposes (including internal use) requires a license. We ask that users who download significant portions of the database cite the DrugBank paper in any resulting publications. Citing DrugBank: Wishart DS, Knox C, Guo AC, Shrivastava S, Hassanali M, Stothard P, Chang Z, Woolsey J. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res. 2006 Jan 1;34(Database issue):D668-72. 16381955.

    DataWarrior showing general information about Vitamine E entry of 'DrugBank'