Automatic extraction of gene/protein biological functions from biomedical text

Open Access

27 October 2004

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 21 (7), 1227-1236
https://doi.org/10.1093/bioinformatics/bti084

Abstract

Motivation: With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing. Results: We have developed a method for automatically extracting the biological process functions of genes/protein/families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 54–64% with a precision of 91–94% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 gene–GO relationships and 150 000 family–GO relationships for major eukaryotes. Availability: The extracted gene functions are available at http://prime.ontology.ims.u-tokyo.ac.jp Contact:akoike@hgc.jp

Keywords

This publication has 15 references indexed in Scilit:

MeKE: discovering the functions of gene products from biomedical literature via sentence alignment
Bioinformatics, 2003
Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction Resource
Genome Research, 2003
The Gene Ontology Annotation (GOA) Project: Implementation of GO in SWISS-PROT, TrEMBL, and InterPro
Genome Research, 2003
Selecting text features for gene name classification
Published by Association for Computational Linguistics (ACL) ,2003
Large-Scale Protein Annotation through Gene Ontology
Genome Research, 2002
Predicting Gene Ontology Functions from ProDom and CDD Protein Domains
Genome Research, 2002
Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical Literature
Genome Research, 2002
Comparison between tagged corpora for the named entity task
Published by Association for Computational Linguistics (ACL) ,2000
Retrieving Terms and their Variants in a Lexicalized Unification-Based Framework
Published by Springer Nature ,1994
A vector space model for automatic indexing
Communications of the ACM, 1975

Cited by 55 articles