Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions

Abstract
Motivation: Data that characterize primary and tertiary structures of proteins are now accumulating at a rapid and accelerating rate and require automated computational tools to extract critical information relating amino acid changes with the spectrum of functionally attributes exhibited by a protein. We propose that immunoglobulin-type beta-domains, which are found in approximate 400 functionally distinct forms in humans alone, provide the immense genetic variation within limited conformational changes that might facilitate the development of new computational tools. As an initial step, we describe here an approach based on Support Vector Machine (SVM) technology to identify amino acid variations that contribute to the functional attribute of pathological self-assembly by some human antibody light chains produced during plasma cell diseases. Results: We demonstrate that SVMs with selective kernel scaling are an effective tool in discriminating between benign and pathologic human immunoglobulin light chains. Initial results compare favorably against manual classification performed by experts and indicate the capability of SVMs to capture the underlying structure of the data. The data set consists of 70 proteins of human antibody κ1 light chains, each represented by aligned sequences of 120 amino acids. We perform feature selection based on a first-order adaptive scaling algorithm, which confirms the importance of changes in certain amino acid positions and identifies other positions that are key in the characterization of protein function. Contact: nelaz@ra.anl.gov fstevens@anl.gov jaques.reifman@amedd.army.mil