Systems‐level analyses identify extensive coupling among gene expression machines

Abstract
Here, we develop computational methods to assess and consolidate large, diverse protein interaction data sets, with the objective of identifying proteins involved in the coupling of multicomponent complexes within the yeast gene expression pathway. From among ∼43 000 total interactions and 2100 proteins, our methods identify known structural complexes, such as the spliceosome and SAGA, and functional modules, such as the DEAD‐box helicases, within the interaction network of proteins involved in gene expression. Our process identifies and ranks instances of three distinct, biologically motivated motifs, or patterns of coupling among distinct machineries involved in different subprocesses of gene expression. Our results confirm known coupling among transcription, RNA processing, and export, and predict further coupling with translation and nonsense‐mediated decay. We systematically corroborate our analysis with two independent, comprehensive experimental data sets. The methods presented here may be generalized to other biological processes and organisms to generate principled, systems‐level network models that provide experimentally testable hypotheses for coupling among biological machines. ### Synopsis Experimental biology has provided significant insight into the individual subprocesses involved in eukaryotic gene expression: transcription, pre‐mRNA capping, splicing and polyadenylation, mRNA export, and translation. However, only recently has the extensive coupling among these subprocesses been recognized ([Maniatis and Reed, 2002][1]; [Orphanides and Reinberg, 2002][2]), and much work remains to elucidate the interactions between molecular machineries involved. To study coupling between gene expression subprocesses in yeast, we chose to take advantage of the large amount of protein interaction data available. High‐throughput protein interaction assays are a potentially rich source of information to identify mechanisms that facilitate cooperation and communication between individual members of multiprotein complexes. While the sheer size and scope of these data sets render them a valuable resource, care must be taken because of the high error rates and biases inherent in high‐throughput data. We developed a three‐part computational strategy to infer coupling among gene expression machineries from available interaction data that minimizes the impact of data set errors and biases, as outlined in [Figure 1][3]. First, we integrated source data sets to create a comprehensive, weighted protein interaction network ([Figure 1A][3]). As densely connected proteins are likely to correspond to molecular machineries, we clustered the resulting protein interaction network to suggest such groupings ([Figure 1B][3]). Finally, to evaluate and present hypotheses regarding coupling among gene expression subprocesses, we searched for intercluster network motifs—patterns in the arrangement of links and clusters—that are signatures of biological coupling between protein complexes ([Figure 1C][3]). We collected 13 yeast data sets for our analysis of the gene expression process, but our method can be generalized to any number of interaction data sets, biological processes, and organisms. Among the challenges of integrating a variety of diverse data sets are the varying coverage and quality of each one and the lack of a comprehensive gold standard. To address these difficulties, we developed a relative data set quality (RDQ) score based on measures of pairwise mutual data set overlap. This concept of evaluating quality based solely on mutual comparison has been used by search engines for ranking web pages ([Page et al , 1998][4]). We weighted the contribution of each data set to the total network by its RDQ, so that the weight of each network edge reflects the calculated reliability of its source data. We compared different RDQ calculation methods, and selected one that both appropriately penalized data sets corrupted by false positives and corresponded to independent reliability criteria. Because false positives and negatives nevertheless persist in the integrated network, we also derived for each pair of proteins a novel pairwise clustering coefficient (CC). Our CC definition provides a measure of the local, weighted network neighborhood around a pair of proteins, including pairs lacking a direct link. Heuristically, for each pair of proteins, links to common neighbors increase the CC, whereas links to uncommon neighbors indicate promiscuous binding and decrease the CC. The CC thus presents a powerful metric toward the identification of network regions that correspond to physical protein complexes. We then developed a method, based on the k ‐means algorithm, to cluster proteins within the integrated network using their CC‐weighted links. The parameters used for both CC calculation and for our clustering algorithm were independently selected based on the ability to discern biologically interpretable clusters. Within this network of interconnected protein clusters, supervised identification of motifs corresponding to known patterns of process coupling yielded hypotheses about coupling within the gene expression pathway. Our three motif types were as follows: direct coupling, identifying strong links between separate clusters; cluster‐mediated coupling, identifying small clusters that link two larger clusters; and adaptor‐mediated coupling, identifying proteins that may belong to either of two clusters, such as scaffolding linker proteins or proteins that shuttle between complexes and transiently associate with each. Ranking all occurrences of these motifs in the network yielded a prioritized list of experimentally verifiable biological hypotheses. Our analysis was corroborated with independent, experimental protein interaction data from new, more comprehensive genome‐wide complex precipitation data sets....