An ILP solution for the gene duplication problem

Open Access

15 February 2011

journal article
Published by Springer Nature in BMC Bioinformatics

Vol. 12 (S1), S14
https://doi.org/10.1186/1471-2105-12-s1-s14

Abstract

Background The gene duplication (GD) problem seeks a species tree that implies the fewest gene duplication events across a given collection of gene trees. Solving this problem makes it possible to use large gene families with complex histories of duplication and loss to infer phylogenetic trees. However, the GD problem is NP-hard, and therefore, most analyses use heuristics that lack any performance guarantee. Results We describe the first integer linear programming (ILP) formulation to solve instances of the gene duplication problem exactly. With simulations, we demonstrate that the ILP solution can solve problem instances with up to 14 taxa. Furthermore, we apply the new ILP solution to solve the gene duplication problem for the seed plant phylogeny using a 12-taxon, 6, 084-gene data set. The unique, optimal solution, which places Gnetales sister to the conifers, represents a new, large-scale genomic perspective on one of the most puzzling questions in plant systematics. Conclusions Although the GD problem is NP-hard, our novel ILP solution for it can solve instances with data sets consisting of as many as 14 taxa and 1, 000 genes in a few hours. These are the largest instances that have been solved to optimally to date. Thus, this work can provide large-scale genomic perspectives on phylogenetic questions that previously could only be addressed by heuristic estimates.

Keywords

This publication has 42 references indexed in Scilit:

Constructing majority-rule supertrees
Algorithms for Molecular Biology, 2010
Species Tree Inference by Minimizing Deep Coalescences
PLoS Computational Biology, 2009
The Impact of Outgroup Choice and Missing Data on Major Seed Plant Phylogenetics Using Genome-Wide EST Data
PLOS ONE, 2009
DupTree: a program for large-scale phylogenetic analyses using gene tree parsimony
Bioinformatics, 2008
Using ESTs for phylogenomics: can one accurately infer a phylogenetic tree from a gappy alignment?
BMC Ecology and Evolution, 2008
Inferring angiosperm phylogeny from EST data with widespread gene duplication
BMC Ecology and Evolution, 2007
RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models
Bioinformatics, 2006
Gene Trees in Species Trees
Systematic Biology, 1997
Reconstruction of Ancient Molecular Phylogeny
Molecular Phylogenetics and Evolution, 1996
The rapid generation of mutation data matrices from protein sequences
Bioinformatics, 1992

Cited by 21 articles