Prediction ofn-Octanol/Water Partition Coefficients from PHYSPROP Database Using Artificial Neural Networks and E-State Indices

Abstract
A new method, ALOGPS v 2.0 (http://www.lnh.unil.ch/∼itetko/logp/), for the assessment of n-octanol/water partition coefficient, log P, was developed on the basis of neural network ensemble analysis of 12 908 organic compounds available from PHYSPROP database of Syracuse Research Corporation. The atom and bond-type E-state indices as well as the number of hydrogen and non-hydrogen atoms were used to represent the molecular structures. A preliminary selection of indices was performed by multiple linear regression analysis, and 75 input parameters were chosen. Some of the parameters combined several atom-type or bond-type indices with similar physicochemical properties. The neural network ensemble training was performed by efficient partition algorithm developed by the authors. The ensemble contained 50 neural networks, and each neural network had 10 neurons in one hidden layer. The prediction ability of the developed approach was estimated using both leave-one-out (LOO) technique and training/test protocol. In case of interseries predictions, i.e., when molecules in the test and in the training subsets were selected by chance from the same set of compounds, both approaches provided similar results. ALOGPS performance was significantly better than the results obtained by other tested methods. For a subset of 12 777 molecules the LOO results, namely correlation coefficient r2 = 0.95, root mean squared error, RMSE = 0.39, and an absolute mean error, MAE = 0.29, were calculated. For two cross-series predictions, i.e., when molecules in the training and in the test sets belong to different series of compounds, all analyzed methods performed less efficiently. The decrease in the performance could be explained by a different diversity of molecules in the training and in the test sets. However, even for such difficult cases the ALOGPS method provided better prediction ability than the other tested methods. We have shown that the diversity of the training sets rather than the design of the methods is the main factor determining their prediction ability for new data. A comparative performance of the methods as well as a dependence on the number of non-hydrogen atoms in a molecule is also presented.