Confidence in evolutionary trees from biological sequence data

Abstract
THE reliable construction of evolutionary trees from nucleotide sequences often depends on randomization tests such as the bootstrap1 and FTP (cladistic permutation tail probability) tests2–6. The genomes of bacteria7, viruses8, animals7,9,10 and plants11, however, vary widely in their nucleotide frequencies. Where genomes have independently acquired similar G+C base compositions, signals in the data arise that cause methods of evolutionary tree reconstruction to estimate the wrong tree by grouping together sequences with similar G+C content12–14. Under these conditions randomization tests can lead to both the rejection of the correct evolutionary hypothesis and acceptance of an incorrect hypothesis (such as with the contradictory inferences from the photosynthetic rbcS and rbcL sequences14). We have proposed one approach to testing for the G+C content problem15. Here we present a formalization of this method, a frequency-dependent significance test, which has general application.