Skip to main navigation Skip to search Skip to main content

The CHEMDNER corpus of chemicals and drugs and its annotation principles

  • Martin Krallinger*
  • , Obdulia Rabal
  • , Florian Leitner
  • , Miguel Vazquez
  • , David Salgado
  • , Zhiyong Lu
  • , Robert Leaman
  • , Yanan Lu
  • , Donghong Ji
  • , Daniel M. Lowe
  • , Roger A. Sayle
  • , Riza Theresa Batista-Navarro
  • , Rafal Rak
  • , Torsten Huber
  • , Tim Rocktäschel
  • , Sérgio Matos
  • , David Campos
  • , Buzhou Tang
  • , Hua Xu
  • , Tsendsuren Munkhdalai
  • Keun Ho Ryu, S. V. Ramanan, Senthil Nathan, Slavko Žitnik, Marko Bajec, Lutz Weber, Matthias Irmer, Saber A. Akhondi, Jan A. Kors, Shuo Xu, Xin An, Utpal Kumar Sikdar, Asif Ekbal, Masaharu Yoshioka, Thaer M. Dieb, Miji Choi, Karin Verspoor, Madian Khabsa, C. Lee Giles, Hongfang Liu, Komandur Elayavilli Ravikumar, Andre Lamurias, Francisco M. Couto, Hong Jie Dai, Richard Tzong Han Tsai, Caglar Ata, Tolga Can, Anabel Usié, Rui Alves, Isabel Segura-Bedmar, Paloma Martínez, Julen Oyarzabal, Alfonso Valencia
*Corresponding author for this work
  • Spanish National Cancer Research Centre
  • University of Navarra
  • Technical University of Madrid
  • Assistance publique - Hôpitaux de Marseille
  • National Institutes of Health
  • Wuhan University
  • NextMove Software Limited
  • University of Manchester
  • Humboldt University of Berlin
  • University College London
  • University of Aveiro
  • Harbin Institute of Technology Shenzhen
  • University of Texas Health Science Center at Houston
  • Chungbuk National University
  • Indian Institute of Technology Madras
  • University of Ljubljana
  • OntoChem GmbH
  • Erasmus University Rotterdam
  • Institute of Scientific and Technical Information of China
  • Beijing Forestry University
  • Indian Institute of Technology Patna
  • Hokkaido University
  • University of Melbourne
  • CSIRO
  • Pennsylvania State University
  • Mayo Clinic Rochester, MN
  • University of Lisbon
  • Taipei Medical University
  • National Central University
  • Middle East Technical University
  • University of Lleida
  • Universidad Carlos III de Madrid

Research output: Contribution to journalArticlepeer-review

Abstract

The automatic extraction of chemical information from text requires the recognition of chemical entity mentions as one of its key steps. When developing supervised named entity recognition (NER) systems, the availability of a large, manually annotated text corpus is desirable. Furthermore, large corpora permit the robust evaluation and comparison of different approaches that detect chemicals in documents. We present the CHEMDNER corpus, a collection of 10,000 PubMed abstracts that contain a total of 84,355 chemical entity mentions labeled manually by expert chemistry literature curators, following annotation guidelines specifically defined for this task. The abstracts of the CHEMDNER corpus were selected to be representative for all major chemical disciplines. Each of the chemical entity mentions was manually labeled according to its structure-associated chemical entity mention (SACEM) class: abbreviation, family, formula, identifier, multiple, systematic and trivial. The difficulty and consistency of tagging chemicals in text was measured using an agreement study between annotators, obtaining a percentage agreement of 91. For a subset of the CHEMDNER corpus (the test set of 3,000 abstracts) we provide not only the Gold Standard manual annotations, but also mentions automatically detected by the 26 teams that participated in the BioCreative IV CHEMDNER chemical mention recognition task. In addition, we release the CHEMDNER silver standard corpus of automatically extracted mentions from 17,000 randomly selected PubMed abstracts. A version of the CHEMDNER corpus in the BioC format has been generated as well. We propose a standard for required minimum information about entity annotations for the construction of domain specific corpora on chemical and drug entities. The CHEMDNER corpus and annotation guidelines are available at: http://www.biocreative.org/resources/biocreative-iv/chemdner-corpus/.

Original languageEnglish
Article numberS2
JournalJournal of Cheminformatics
Volume7
DOIs
StatePublished - 2015
Externally publishedYes

Fingerprint

Dive into the research topics of 'The CHEMDNER corpus of chemicals and drugs and its annotation principles'. Together they form a unique fingerprint.

Cite this