Extraction of Transcript Diversity from Scientific Literature

Abstract
Transcript diversity generated by alternative splicing and associated mechanisms contributes heavily to the functional complexity of biological systems. The numerous examples of the mechanisms and functional implications of these events are scattered throughout the scientific literature. Thus, it is crucial to have a tool that can automatically extract the relevant facts and collect them in a knowledge base that can aid the interpretation of data from high-throughput methods. We have developed and applied a composite text-mining method for extracting information on transcript diversity from the entire MEDLINE database in order to create a database of genes with alternative transcripts. It contains information on tissue specificity, number of isoforms, causative mechanisms, functional implications, and experimental methods used for detection. We have mined this resource to identify 959 instances of tissue-specific splicing. Our results in combination with those from EST-based methods suggest that alternative splicing is the preferred mechanism for generating transcript diversity in the nervous system. We provide new annotations for 1,860 genes with the potential for generating transcript diversity. We assign the MeSH term “alternative splicing” to 1,536 additional abstracts in the MEDLINE database and suggest new MeSH terms for other events. We have successfully extracted information about transcript diversity and semiautomatically generated a database, LSAT, that can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression. LSAT (Literature Support for Alternative Transcripts) is publicly available at http://www.bork.embl.de/LSAT/. Given the functional complexity of higher eukaryotes, the relatively small number of genes in the human and other mammalian genomes came as a surprise to the scientific community. Later it was discovered that the majority of genes are subject to alternative splicing (“cutting and pasting”) or associated mechanisms that ultimately increase the diversity of transcripts that code for proteins. Studies exploring transcript diversity are currently dominated by high-throughput experiments and computational methods; however, the quality of such data should be assessed against a reliable reference set based on single-gene studies. Unfortunately, the latter type of information is scattered throughout the scientific literature. The authors have thus developed a computational approach for extracting information on alternative transcripts from MEDLINE abstracts and used it to create a database, LSAT. LSAT (Literature Support for Alternative Transcripts) provides information for more than 4,000 genes from about 14,000 abstracts. This database can provide a quantitative understanding of the mechanisms behind tissue-specific gene expression based on single-gene studies, which we show agrees well with EST-based studies (these studies involve tissue-specific splicing detected by the analysis of libraries of expressed sequence tags [ESTs]). These results indicate that mechanisms like alternative splicing, alternative promoters, and alternative polyadenylation work in concert to generate and regulate transcript diversity. More generally, information extraction of complex biological process seems feasible and can also complement large-scale data generation in other areas to assign functions to genes.