IMPLICATIONS OF COMPOSITIONALITY IN THE GENE ONTOLOGY FOR ITS CURATION AND USAGE

Abstract
In this paper we argue that a richer underlying representational model for the Gene Ontology that captures the implicit compositional structure of GO terms could have a positive impact on two activities crucial to the success of GO: ontology curation and database annotation. We show that many of the new terms added to GO in a one-year span appear to be compositional variations of other terms. We found that 90.2% of the 3,652 new terms added between July 2003 and July 2004 exhibited characteristics of compositionality. We also examine annotations available from the GO Consortium website that are either manually curated or automatically generated. We found that 74.5% and 63.2% of GO terms are seldom, if ever, used in manual and automatic annotations, respectively. We show that there are features that tend to distinguish terms that are used from those that are not. In order to characterize the effect of compositionality on the combinatorial properties of GO, we employ finite state automata that represent sets of GO terms. This representational tool demonstrates how ontologies can grow very fast, and also shows that small conceptual changes can directly result in a large number of changes to the terminology. We argue that the curation and annotation findings we report are influenced by the combinatorial properties that present themselves in an ontology that does not have a model that properly captures the compositional structure of its terms.