Natural Selection and Algorithmic Design of mRNA

Abstract
Messenger RNA (mRNA) sequences serve as templates for proteins according to the triplet code, in which each of the 43 = 64 different codons (sequences of three consecutive nucleotide bases) in RNA either terminate transcription or map to one of the 20 different amino acids (or residues) which build up proteins. Because there are more codons than residues, there is inherent redundancy in the coding. Certain residues (e.g., tryptophan) have only a single corresponding codon, while other residues (e.g., arginine) have as many as six corresponding codons. This freedom implies that the number of possible RNA sequences coding for a given protein grows exponentially in the length of the protein. Thus nature has wide latitude to select among mRNA sequences which are informationally equivalent, but structurally and energetically divergent. In this paper, we explore how nature takes advantage of this freedom and how to algorithmically design structures more energetically favorable than have been built through natural selection. In particular: (1) Natural Selection - we perform the first large-scale computational experiment comparing the stability of mRNA sequences from a variety of organisms to random synonymous sequences which respect the codon preferences of the organism. This experiment was conducted on over 27,000 sequences from 34 microbial species with 36 genomic structures. We provide evidence that in all genomic structures highly stable sequences are disproportionately abundant, and in 19 of 36 cases highly unstable sequences are disproportionately abundant. This suggests that the stability of mRNA sequences is subject to natural selection. (2) Artificial Selection - motivated by these biological results, we examine the algorithmic problem of designing the most stable and unstable mRNA sequences which code for a target protein. We give a polynomial-time dynamic programming solution to the most stable sequence problem (MSSP), which is asymptotically no more complex than secondary structure prediction. We show that the corresponding least stable sequence problem (LSSP) is NP-complete, and develop two heuristics for the construction of such sequences. We have implemented these algorithms, and present experimental results placing the high/low stability sequences in context with both wildtype and random encodings. Our implementation has already been applied to the design of RNA "code-words" creating little or no secondary structure in RNA computing (Brenneman and Condon, 2001; Marathe et al., 2001), and we anticipate a variety of other applications of this work to sequence design problems (Skiena, 2001).

This publication has 15 references indexed in Scilit: