GENETAG: a tagged corpus for gene/protein named entity recognition

Open Access

24 May 2005

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 6 (S1), S3
https://doi.org/10.1186/1471-2105-6-s1-s3

Abstract

Named entity recognition (NER) is an important first step for text mining the biomedical literature. Evaluating the performance of biomedical NER systems is impossible without a standardized test corpus. The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition.

Keywords

This publication has 5 references indexed in Scilit:

BioCreAtIvE Task 1A: gene mention finding evaluation
BMC Bioinformatics, 2005
GENIA corpus—a semantically annotated corpus for bio-textmining
Bioinformatics, 2003
Tagging gene and protein names in biomedical text
Bioinformatics, 2002
Disambiguating proteins, genes, and RNA in text: a machine learning approach
Bioinformatics, 2001
Boosting naïve Bayesian learning on a large subset of MEDLINE.
2000

Cited by 159 articles