Correlations in DNA sequences: The role of protein coding segments

Abstract
Protein coding segments (exons) exhibit persistent correlations between their nucleotides with a pronounced period three. It is shown in this paper that this periodicity induced by the nonuniform codon usage implies long-range correlation over hundreds of base pairs if the length distribution of exons is taken into account. We derive expressions which relate the length distribution of exons to the correlation decay and find agreement with numerical simulations. Finally, we analyze the decay of the mutual information function in yeast chromosomes, in an E. coli chromosome region, and in myosin heavy chain genes as representative examples. It turns out that in these cases we can explain most of the long-range statistical dependences even quantitatively.