Automatic annotation for biological sequences by extraction of keywords from MEDLINE abstracts. Development of a prototype system.

1 January 1997

journal article
research article

Vol. 5, 25-32

Abstract

We have developed a prototype for the automatic annotation of functional characteristics in protein families. The system is able to extract biological information directly from scientific literature in the form of MEDLINE abstracts. The criterion for selecting relevant keywords is the difference between their frequency in the abstracts associated with the protein family under study and its frequency in other unrelated protein families. The concept of functional information associated to protein families is the key feature of our system and gathers evolutionary information into the problem of functional annotation of biological sequences. The system has been tested in two different scenarios: first, a large set of protein families with a small number of abstract per family and second, selected protein families with large number of abstracts attached to each one. In both cases the performances are compared with annotations provided by human experts showing a clear relation between the amount of information provided to the system and the quality of the annotations. The automatic annotations are in many cases of similar quality to the ones contained in current data bases. The possibilities and difficulties to be encountered during the development of a full system for automatic annotation are discussed.

Cited by 10 articles