Maximum Likelihood Estimates of Species Trees: How Accuracy of Phylogenetic Inference Depends upon the Divergence History and Sampling Design

Abstract
The understanding that gene trees are often in discord with each other and with the species trees that contain them has led researchers to methods that incorporate the inherent stochasticity of genetic processes in the phylogenetic estimation procedure. Recently developed methods for species-tree estimation that not only consider the retention and sorting of ancestral polymorphism but also quantify the actual probabilities of incomplete lineage sorting are expected to provide an improvement over earlier summary-statistic based approaches that discard much of the information content of gene trees. However, these new methods have yet to be tested on truly challenging evolutionary histories such as those marked by recent rapid speciation where high levels of incomplete lineage sorting and discord among gene trees predominate. Here, we test a new maximum-likelihood method that incorporates stochastic models of both nucleotide substitution and lineage sorting for species-tree estimation. Using a simulation approach, we consider a broad range of species-tree topologies under 2 scenarios representing moderate and severe incomplete lineage sorting. We show that the maximum-likelihood method results in more accurate species trees than a summary-statistic based approach, demonstrating that information contained in discordant gene trees can be effectively extracted using a full probabilistic model. Moreover, we demonstrate that the shape of the original species tree (i.e., the relative lengths of internal branches) has a significant impact on whether the species tree is estimated accurately. In the speciation histories explored here, it is not just the recent origin of species that affects the accuracy of the estimates but the variance in relative species divergence times as well. Additionally, we show that sampling effort (number of individuals and/or loci) and sampling design (ratio of individuals to loci) are both important factors affecting the accuracy of species-tree estimates, which is again affected by the relative timing of divergence among species. The inherent difficulties of estimating relationships when species have undergone a recent radiation are discussed, and in particular, the limitations with maximum-likelihood estimates of species trees that do not consider uncertainty in the estimated gene trees of individual loci. Thus, despite substantial improvements over current summary-statistic based approaches, and the increased sophistication of procedures that incorporate the process of gene lineage coalescence, recent radiations still appear to pose daunting challenges for phylogenetics