BioInfer: a corpus for information extraction in the biomedical domain

Top Cited Papers

Open Access

9 February 2007

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 8 (1), 50
https://doi.org/10.1186/1471-2105-8-50

Abstract

Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at http://www.it.utu.fi/BioInfer.

Keywords

This publication has 19 references indexed in Scilit:

Lexical adaptation of link grammar to the biomedical sublanguage: a comparative evaluation of three approaches
BMC Bioinformatics, 2006
Evaluation of two dependency parsers on biomedical corpus targeted at protein–protein interactions
International Journal of Medical Informatics, 2006
Agreement, the F-Measure, and Reliability in Information Retrieval
Journal of the American Medical Informatics Association, 2005
PASBio: predicate-argument structures for event extraction in molecular biology
BMC Bioinformatics, 2004
Extracting biochemical interactions from MEDLINE using a link grammar parser
Published by Institute of Electrical and Electronics Engineers (IEEE) ,2004
Extracting human protein interactions from MEDLINE using a full-sentence parser
Bioinformatics, 2004
The Database of Interacting Proteins: 2004 update
Nucleic Acids Research, 2004
Mining the Biomedical Literature in the Genomic Era: An Overview
Journal of Computational Biology, 2003
Adding a medical lexicon to an English Parser.
2003
A Coefficient of Agreement for Nominal Scales
Educational and Psychological Measurement, 1960

Cited by 269 articles