A Distributed Framework for Large-scale Protein-protein Interaction Data Analysis and Prediction Using MapReduce

Funds:  This work was supported in part by the National Natural Science Foundation of China (61772493), the CAAI-Huawei MindSpore Open Fund (CAAIXSJLJJ-2020-004B), the Natural Science Foundation of Chongqing (China) (cstc2019jcyjjqX0013), Chongqing Research Program of Technology Innovation and Application (cstc2019jscx-fxydX0024, cstc2019jscx-fxydX0027, cstc2018jszx-cyzdX0041), Guangdong Province Universities and College Pearl River Scholar Funded Scheme (2019), the Pioneer Hundred Talents Program of Chinese Academy of Sciences, and the Deanship of Scientific Research (DSR) at King Abdulaziz University (G-21-135-38)
  • Protein-protein interactions are of great significance for human to understand the functional mechanisms of proteins. With the rapid development of high-throughput genomic technologies, massive protein-protein interaction (PPI) data have been generated, making it very difficult to analyze them efficiently. To address this problem, this paper presents a distributed framework by reimplementing one of state-of-the-art algorithms, i.e., CoFex, using MapReduce. To do so, an in-depth analysis of its limitations is conducted from the perspectives of efficiency and memory consumption when applying it for large-scale PPI data analysis and prediction. Respective solutions are then devised to overcome these limitations. In particular, we adopt a novel tree-based data structure to reduce the heavy memory consumption caused by the huge sequence information of proteins. After that, its procedure is modified by following the MapReduce framework to take the prediction task distributively. A series of extensive experiments have been conducted to evaluate the performance of our framework in terms of both efficiency and accuracy. Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy.


    • In this paper, a distributed framework is presented to reimplement one of state-of-the-art algorithms with MapReuce such that it can be applied for large-scale protein-protein interaction prediction
    • Experimental results well demonstrate that the proposed framework can considerably improve its computational efficiency by more than two orders of magnitude while retaining the same high accuracy
    • the upper limit of the proposed framework efficiency exists, as it is impossible to reduce its running time by simply increasing the number of computing nodes. We note that when the number of computing nodes exceeds some threshold, the process of data transfer takes more time than the computation, and thus constrains the further improvement of efficiency


