Exploring classification strategies with the CoEPrA 2006 contest

Open Access

22 January 2010

journal article
research article
Published by Oxford University Press (OUP) in Bioinformatics

Vol. 26 (5), 603-609
https://doi.org/10.1093/bioinformatics/btq021

Abstract

Motivation:In silico methods to classify compounds as potential drugs that bind to a specific target become increasingly important for drug design. To build classification devices training sets of drugs with known activities are needed. For many such classification problems, not only qualitative but also quantitative information of a specific property (e.g. binding affinity) is available. The latter can be used to build a regression scheme to predict this property for new compounds. Predicting a compound property explicitly is generally more difficult than classifying that the property lies below or above a given threshold value. Hence, an indirect classification that is based on regression may lead to poorer results than a direct classification scheme. In fact, initially researchers are only interested to classify compounds as potential drugs. The activities of these compounds are subsequently measured in wet lab. Results: We propose a novel approach that uses available quantitative information directly for classification rather than first using a regression scheme. It uses a new type of loss function called weighted biased regression. Application of this method to four widely studied datasets of the CoEPrA contest (Comparative Evaluation of Prediction Algorithms, http://coepra.org) shows that it can outperform simple classification methods that do not make use of this additional quantitative information. Availability: A stand alone application is available at the webpage http://agknapp.chemie.fu-berlin.de/agknapp/index.php?menu=software&page=PeptideClassifier that can be used to build a model for a peptide training set to be submitted. Contact:odemir@chemie.fu-berlin.de Supplementary Information: Supplementary data are available at Bioinformatics online.

Keywords

This publication has 27 references indexed in Scilit:

SmcHD1, containing a structural-maintenance-of-chromosomes hinge domain, has a critical role in X inactivation
Nature Genetics, 2008
kScore: a novel machine learning approach that is not dependent on the data structure of the training set
Journal of Computer-Aided Molecular Design, 2007
Weighted quality estimates in machine learning
Bioinformatics, 2006
Towards the chemometric dissection of peptide – HLA-A*0201 binding affinity: comparison of local and global QSAR models
Journal of Computer-Aided Molecular Design, 2005
How T cells 'see' antigen
Nature Immunology, 2005
Learning from imbalanced data sets with boosting and data generation
ACM SIGKDD Explorations Newsletter, 2004
Physicochemical explanation of peptide binding to HLA‐A*0201 major histocompatibility complex: A three‐dimensional quantitative structure‐activity relationship study
Proteins-Structure Function and Bioinformatics, 2002
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
Nucleic Acids Research, 1997
Meeting review: the Second Meeting on the Critical Assessment of Techniques for Protein Structure Prediction (CASP2), Asilomar, California, December 13–16, 1996
Folding and Design, 1997
LIII. On lines and planes of closest fit to systems of points in space
Journal of Computers in Education, 1901

Cited by 10 articles