Automatic extraction of gene/protein biological functions from biomedical text
Open Access
- 27 October 2004
- journal article
- research article
- Published by Oxford University Press (OUP) in Bioinformatics
- Vol. 21 (7), 1227-1236
- https://doi.org/10.1093/bioinformatics/bti084
Abstract
Motivation: With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing. Results: We have developed a method for automatically extracting the biological process functions of genes/protein/families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 54–64% with a precision of 91–94% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 gene–GO relationships and 150 000 family–GO relationships for major eukaryotes. Availability: The extracted gene functions are available at http://prime.ontology.ims.u-tokyo.ac.jp Contact:akoike@hgc.jpKeywords
This publication has 15 references indexed in Scilit:
- MeKE: discovering the functions of gene products from biomedical literature via sentence alignmentBioinformatics, 2003
- Kinase Pathway Database: An Integrated Protein-Kinase and NLP-Based Protein-Interaction ResourceGenome Research, 2003
- The Gene Ontology Annotation (GOA) Project: Implementation of GO in SWISS-PROT, TrEMBL, and InterProGenome Research, 2003
- Selecting text features for gene name classificationPublished by Association for Computational Linguistics (ACL) ,2003
- Large-Scale Protein Annotation through Gene OntologyGenome Research, 2002
- Predicting Gene Ontology Functions from ProDom and CDD Protein DomainsGenome Research, 2002
- Associating Genes with Gene Ontology Codes Using a Maximum Entropy Analysis of Biomedical LiteratureGenome Research, 2002
- Comparison between tagged corpora for the named entity taskPublished by Association for Computational Linguistics (ACL) ,2000
- Retrieving Terms and their Variants in a Lexicalized Unification-Based FrameworkPublished by Springer Nature ,1994
- A vector space model for automatic indexingCommunications of the ACM, 1975