Abstract
Motivation: The large volume of single nucleotide polymorphism data now available motivates the development of methods for distinguishing neutral changes from those which have real biological effects. Here, two different machine-learning methods, decision trees and support vector machines (SVMs), are applied for the first time to this problem. In common with most other methods, only non-synonymous changes in protein coding regions of the genome are considered. Results: In detailed cross-validation analysis, both learning methods are shown to compete well with existing methods, and to out-perform them in some key tests. SVMs show better generalization performance, but decision trees have the advantage of generating interpretable rules with robust estimates of prediction confidence. It is shown that the inclusion of protein structure information produces more accurate methods, in agreement with other recent studies, and the effect of using predicted rather than actual structure is evaluated. Availability: Software is available on request from the authors.