Multidimensional Vector Space Representation for Convergent Evolution and Molecular Phylogeny

Abstract
With growing amounts of genome data and constant improvement of models of molecular evolution, phylogenetic reconstruction became more reliable. However, our knowledge of the real process of molecular evolution is still limited. When enough large-sized data sets are analyzed, any subtle biases in statistical models can support incorrect topologies significantly because of the high signal-to-noise ratio. We propose a procedure to locate sequences in a multidimensional vector space (MVS), in which the geometry of the space is uniquely determined in such a way that the vectors of sequence evolution are orthogonal among different branches. In this paper, the MVS approach is developed to detect and remove biases in models of molecular evolution caused by unrecognized convergent evolution among lineages or unexpected patterns of substitutions. Biases in the estimated pairwise distances are identified as deviations (outliers) of sequence spatial vectors from the expected orthogonality. Modifications to the estimated distances are made by minimizing an index to quantify the deviations. In this way, it becomes possible to reconstruct the phylogenetic tree, taking account of possible biases in the model of molecular evolution. The efficacy of the modification procedure was verified by simulating evolution on various topologies with rate heterogeneity and convergent change. The phylogeny of placental mammals in previous analyses of large data sets has varied according to the genes being analyzed. Systematic deviations caused by convergent evolution were detected by our procedure in all representative data sets and were found to strongly affect the tree structure. However, the bias correction yielded a consistent topology among data sets. The existence of strong biases was validated by examining the sites of convergent evolution between the hedgehog and other species in mitochondrial data set. This convergent evolution explains why it has been difficult to determine the phylogenetic placement of the hedgehog in previous studies.