Abstract
The problem of discovering the most parsimonious tree is defined in terms of a set of linearly arrayed sequences. Simplifications are introduced to reduce the total amount of work including the elimination of uninformative positions and the recognition of equivalent positions. The procedure can be applied to any array of sequences, including amino acid. Failure to convert such sequences, through the genetic code, into nucleotide sequences is very wasteful of pertinent information. Parsimony is a procedure that minimizes discordancies (parallel and/or back substitutions). A procedure (a discordancy diagram) is given that enables one to recognize when 2 characters (nucleotide positions) will necessitate the acceptance of such discordancies and how many, at least, will be unavoidable. Subtraction of these unavoidable discordancies from a matrix of potential discordancies leads to a matrix of avoidable discordancies that generally give at least 2 pairs of taxa that are most closely related parsimoniously (i.e., 0 avoidable discordancies) and may be replaced by an ancestral form determined by the parsimony process. The parsimony process is also given. The process may be repetitively performed until the tree is completed. A method of determining a lower bound to the number of substitutions required is given that gives a much larger lower bound than previous estimates. A quick estimate of the upper bound is also provided. An alternative approach, using a Prim-Kruskal network (minimal spanning tree) on the avoidable discordancy distances, is given together with a procedure for interpreting such networks in terms of a phylogeny that appears more natural than the dendrograms usually employed in the interpretation of single-linkage diagrams. The reduction of a strictly bifurcating tree to a Prim-Kruskal network is called compression and its reverse is called decompression with the original tree being recovered after a complete cycle of the 2 processes. There is a unique 1-to-1 correspondence between any phylogenetic tree and its compressed network form. The compressed form can be the basis for an unambiguous linear representation of a tree that is more compact than that of any other known method of representation.

This publication has 4 references indexed in Scilit: