Natural Selection and Algorithmic Design of mRNA

1 June 2003

journal article
research article
Published by Mary Ann Liebert Inc in Journal of Computational Biology

Vol. 10 (3-4), 419-432
https://doi.org/10.1089/10665270360688101

Abstract

Messenger RNA (mRNA) sequences serve as templates for proteins according to the triplet code, in which each of the 4³ = 64 different codons (sequences of three consecutive nucleotide bases) in RNA either terminate transcription or map to one of the 20 different amino acids (or residues) which build up proteins. Because there are more codons than residues, there is inherent redundancy in the coding. Certain residues (e.g., tryptophan) have only a single corresponding codon, while other residues (e.g., arginine) have as many as six corresponding codons. This freedom implies that the number of possible RNA sequences coding for a given protein grows exponentially in the length of the protein. Thus nature has wide latitude to select among mRNA sequences which are informationally equivalent, but structurally and energetically divergent. In this paper, we explore how nature takes advantage of this freedom and how to algorithmically design structures more energetically favorable than have been built through natural selection. In particular: (1) Natural Selection - we perform the first large-scale computational experiment comparing the stability of mRNA sequences from a variety of organisms to random synonymous sequences which respect the codon preferences of the organism. This experiment was conducted on over 27,000 sequences from 34 microbial species with 36 genomic structures. We provide evidence that in all genomic structures highly stable sequences are disproportionately abundant, and in 19 of 36 cases highly unstable sequences are disproportionately abundant. This suggests that the stability of mRNA sequences is subject to natural selection. (2) Artificial Selection - motivated by these biological results, we examine the algorithmic problem of designing the most stable and unstable mRNA sequences which code for a target protein. We give a polynomial-time dynamic programming solution to the most stable sequence problem (MSSP), which is asymptotically no more complex than secondary structure prediction. We show that the corresponding least stable sequence problem (LSSP) is NP-complete, and develop two heuristics for the construction of such sequences. We have implemented these algorithms, and present experimental results placing the high/low stability sequences in context with both wildtype and random encodings. Our implementation has already been applied to the design of RNA "code-words" creating little or no secondary structure in RNA computing (Brenneman and Condon, 2001; Marathe et al., 2001), and we anticipate a variety of other applications of this work to sequence design problems (Skiena, 2001).

Keywords

This publication has 15 references indexed in Scilit:

On Combinatorial DNA Word Design
Journal of Computational Biology, 2001
Design of multistable RNA molecules
RNA, 2001
Chargaff's legacy
Gene, 2000
Gradients in nucleotide and codon usage along Escherichia coli genes
Nucleic Acids Research, 2000
Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure
Journal of Molecular Biology, 1999
De Novo Protein Design: Fully Automated Sequence Selection
Science, 1997
Fast folding and comparison of RNA secondary structures
Monatshefte für Chemie / Chemical Monthly, 1994
Improved predictions of secondary structures for RNA.
Proceedings of the National Academy of Sciences, 1989
Attenuation in the Escherichia coli tryptophan operon: role of RNA secondary structure involving the tryptophan codon region.
Proceedings of the National Academy of Sciences, 1979
Normal Recurring Decimals
Journal of the London Mathematical Society, 1946

Cited by 26 articles