Recognizing exons in genomic sequence using GRAIL II.

  • 1 January 1994
    • journal article
    • Vol. 16, 241-53
Abstract
We have described an improved neural network system for recognizing protein coding regions (exons) in human genomic DNA sequences. This coding region recognition system is part of a new version of GRAIL, GRAIL II, and represents a significant improvement over the coding recognition performance of the previous GRAIL system. GRAIL II divides the process of locating exons into four steps. It first generates an exon candidate pool consisting of all possible (translation start-donor), (acceptor-donor), and (acceptor-translation stop) pairs within all open reading frames of the test sequence. The vast majority of these exon candidates are eliminated from consideration by applying a set of heuristic rules. After reducing the size of the candidate pool, GRAIL II uses three trained neural networks to evaluate the coding potential and accuracy of the edges of starting exon, internal exon and terminal exon candidates. These networks output a set of overlapping candidates for each exon which differ by their scores and position of their edges. Multiple candidates for a given exon are grouped into a cluster based on their locations relative to candidates corresponding to other exons, and the highest scoring candidate for each cluster is used as the "best" prediction of the corresponding exon. Unlike the previous GRAIL version, GRAIL II uses variable-length windows to evaluate exon candidates and its performance is nearly independent of exon length. In addition to several strong indicators of coding potential, the system uses several other types of information including scores for splice junctions, GC composition, and the properties of the regions adjacent to an exon candidate, to aid in the discrimination process. On a large set of sequences from Genbank (3), GRAIL II located 93% of all exons regardless of size with a false positive rate of 12%. Among the true positives, 62% match the actual exons exactly (the exons edges are correct to the base), and 93% match at least one edge correctly. These statistics are further improved, especially the false positive rate and accuracy of the edges, through a process of gene model construction by the Gene Assembly Program (GAP III) (4) module of GRAIL II, which uses the scored exon candidates as input and constructs optimal gene models. The gene modeling system will be described elsewhere.