Bayesian analysis of amino acid substitution models

7 October 2008

journal article
Published by The Royal Society in Philosophical Transactions Of The Royal Society B-Biological Sciences

Vol. 363 (1512), 3941-3953
https://doi.org/10.1098/rstb.2008.0175

Abstract

Models of amino acid substitution present challenges beyond those often faced with the analysis of DNA sequences. The alignments of amino acid sequences are often small, whereas the number of parameters to be estimated is potentially large when compared with the number of free parameters for nucleotide substitution models. Most approaches to the analysis of amino acid alignments have focused on the use of fixed amino acid models in which all of the potentially free parameters are fixed to values estimated from a large number of sequences. Often, these fixed amino acid models are specific to a gene or taxonomic group (e.g. the Mtmam model, which has parameters that are specific to mammalian mitochondrial gene sequences). Although the fixed amino acid models succeed in reducing the number of free parameters to be estimated—indeed, they reduce the number of free parameters from approximately 200 to 0—it is possible that none of the currently available fixed amino acid models is appropriate for a specific alignment. Here, we present four approaches to the analysis of amino acid sequences. First, we explore the use of a general time reversible model of amino acid substitution using a Dirichlet prior probability distribution on the 190 exchangeability parameters. Second, we then explore the behaviour of prior probability distributions that are ‘centred’ on the rates specified by the fixed amino acid model. Third, we consider a mixture of fixed amino acid models. Finally, we consider constraints on the exchangeability parameters as partitions, similar to how nucleotide substitution models are specified, and place a Dirichlet process prior model on all the possible partitioning schemes.

Keywords

This publication has 39 references indexed in Scilit:

Inference of Population Structure Under a Dirichlet Process Model
Genetics, 2007
A Dirichlet process model for detecting positive selection in protein-coding DNA sequences
Proceedings of the National Academy of Sciences, 2006
rtREV: An Amino Acid Substitution Matrix for Inference of Retrovirus and Reverse Transcriptase Phylogeny
Journal of Molecular Evolution, 2002
Partition-distance: A problem and class of perfect graphs arising in clustering
Information Processing Letters, 2002
Inference from Iterative Simulation Using Multiple Sequences
Statistical Science, 1992
The rapid generation of mutation data matrices from protein sequences
Bioinformatics, 1992
Evolutionary trees from DNA sequences: A maximum likelihood approach
Journal of Molecular Evolution, 1981
A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences
Journal of Molecular Evolution, 1980
A Bayesian Analysis of Some Nonparametric Problems
The Annals of Statistics, 1973
On Information and Sufficiency
The Annals of Mathematical Statistics, 1951

Cited by 42 articles