Abstract
Cluster analysis is presented as a technique for analyzing the conservation and chemistry of water sites from independent protein structures, and applied to thrombin, trypsin, and bovine pancreatic trypsin inhibitor (BPTI) to locate shared water sites, as well as those contributing to specificity. When several protein structures are superimposed, complete linkage cluster analysis provides an objective technique for resolving the continuum of overlaps between water sites into a set of maximally dense microclusters of overlapping water molecules, and also avoids reliance on any one structure as a reference. Water sites were clustered for ten superimposed thrombin structures, three trypsin structures, and four BPTI structures. For thrombin, 19% of the 708 microclusters, representing unique water sites, contained water molecules from at least half of the structures, and 4% contained waters from all 10. For trypsin, 77% of the 106 microclusters contained water sites from at least half of the structures, and 57% contained waters from all three. Water site conservation correlated with several environmental features: highly conserved microclusters generally had more protein atom neighbors, were in a more hydrophilic environment, made more hydrogen bonds to the protein, and were less mobile. There were significant overlaps between thrombin and trypsin conserved water sites, which did not localize to their similar active sites, but were concentrated in buried regions including the solvent channel surrounding the Na+ site in thrombin, which is associated with ligand selectivity. Cluster analysis also identified water sites conserved in thrombin but not trypsin, and vice versa, providing a list of water sites that may contribute to ligand discrimination. Thus, in addition to facilitating the analysis of water sites from multiple structures, cluster analysis provides a useful tool for distinguishing between conserved features within a protein family and those conferring specificity.