Automatic extraction of gene/protein biological functions from biomedical text

Abstract
Motivation: With the rapid advancement of biomedical science and the development of high-throughput analysis methods, the extraction of various types of information from biomedical text has become critical. Since automatic functional annotations of genes are quite useful for interpreting large amounts of high-throughput data efficiently, the demand for automatic extraction of information related to gene functions from text has been increasing. Results: We have developed a method for automatically extracting the biological process functions of genes/protein/families based on Gene Ontology (GO) from text using a shallow parser and sentence structure analysis techniques. When the gene/protein/family names and their functions are described in ACTOR (doer of action) and OBJECT (receiver of action) relationships, the corresponding GO-IDs are assigned to the genes/proteins/families. The gene/protein/family names are recognized using the gene/protein/family name dictionaries developed by our group. To achieve wide recognition of the gene/protein/family functions, we semi-automatically gather functional terms based on GO using co-occurrence, collocation similarities and rule-based techniques. A preliminary experiment demonstrated that our method has an estimated recall of 54–64% with a precision of 91–94% for actually described functions in abstracts. When applied to the PUBMED, it extracted over 190 000 gene–GO relationships and 150 000 family–GO relationships for major eukaryotes. Availability: The extracted gene functions are available at http://prime.ontology.ims.u-tokyo.ac.jp Contact:akoike@hgc.jp