GENETAG: a tagged corpus for gene/protein named entity recognition
Open Access
- 24 May 2005
- journal article
- Published by Springer Nature in BMC Bioinformatics
- Vol. 6 (S1), S3
- https://doi.org/10.1186/1471-2105-6-s1-s3
Abstract
Named entity recognition (NER) is an important first step for text mining the biomedical literature. Evaluating the performance of biomedical NER systems is impossible without a standardized test corpus. The annotation of such a corpus for gene/protein name NER is a difficult process due to the complexity of gene/protein names. We describe the construction and annotation of GENETAG, a corpus of 20K MEDLINE® sentences for gene/protein NER. 15K GENETAG sentences were used for the BioCreAtIvE Task 1A Competition.Keywords
This publication has 5 references indexed in Scilit:
- BioCreAtIvE Task 1A: gene mention finding evaluationBMC Bioinformatics, 2005
- GENIA corpus—a semantically annotated corpus for bio-textminingBioinformatics, 2003
- Tagging gene and protein names in biomedical textBioinformatics, 2002
- Disambiguating proteins, genes, and RNA in text: a machine learning approachBioinformatics, 2001
- Boosting naïve Bayesian learning on a large subset of MEDLINE.2000