Performance Comparison of Data Sampling Techniques to Handle Imbalanced Class on Prediction of Compound-Protein Interaction

Akhmad Rezki Purnajaya; Wisnu Ananta Kusuma; Medria Kusuma Dewi Hardhienata

doi:10.24252/bio.v8i1.12002

Akhmad Rezki Purnajaya Universal University
(ID) http://orcid.org/0000-0002-2802-7518
Wisnu Ananta Kusuma IPB University
(ID)
Medria Kusuma Dewi Hardhienata IPB University
(ID)

DOI: https://doi.org/10.24252/bio.v8i1.12002

Abstract

The prediction of Compound-Protein Interactions (CPI) is an essential step in the drug-target analysis for developing new drugs as well as for drug repositioning. One challenging issue in this field is that commonly there are more numbers of non-interacting compound-protein pairs than interacting pairs. This problem causes bias, which may degrade the prediction of CPI. Besides, currently, there is not much research on CPI prediction that compares data sampling techniques to handle the class imbalance problem. To address this issue, we compare four data sampling techniques, namely Random Under-sampling (RUS), Combination of Over-Under-sampling (COUS), Synthetic Minority Over-sampling Technique (SMOTE), and Tomek Link (T-Link). The benchmark CPI data: Nuclear Receptor and G-Protein Coupled Receptor (GPCR) are used to test these techniques. Area Under Curve (AUC) applied to evaluate the CPI prediction performance of each technique. Results show that the AUC values for RUS, COUS, SMOTE, and T-Link are 0.75, 0.77, 0.85 and 0.79 respectively on Nuclear Receptor data and 0.70, 0.85, 0.91 and 0.72 respectively on GPCR data. These results indicate that SMOTE has the highest AUC values. Furthermore, we found that the SMOTE technique is more capable of handling class imbalance problems on CPI prediction compared to the remaining three other techniques.

Author Biographies

Akhmad Rezki Purnajaya, Universal University

Department of Software Engineering, Faculty of Computer

Wisnu Ananta Kusuma, IPB University

Tropical Biopharmaca Research Center, Faculty of Math and Science

Medria Kusuma Dewi Hardhienata, IPB University

Department of Computer Science, Faculty of Math and Science

References

Batista GEDAPA, Monard MC. 2002. A Study of K-Nearest Neighbour as an Imputation Method. His. vol 87: 251‒260.

Bleakley K, Yamanishi Y. 2009. Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics. vol 25(18): 2397‒2403. doi: https://doi.org/10.1093/bioinformatics/btp433.

Buza K, Peška L. 2017. Drug–target interaction prediction with Bipartite Local Models and hubness-aware regression. Neurocomputing. vol 260: 284‒293. doi: https://doi.org/10.1016/j.neucom.2017.04.055.

Chawla NV. 2003. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of Workshop on Learning from Imbalanced Data Sets (II). August 21, 2003. Washington DC: ICML. vol 3: 66-73.

Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. vol 16: 321‒357. doi: https://doi.org/10.1613/jair.953.

Cui Z, Liu JX, Gao YL, Zhu R, Yuan SS. 2019. LncRNA-disease associations prediction using bipartite local model with nearest profile-based association inferring. IEEE Journal of Biomedical and Health Informatics. vol 24(5): 1519‒1527. doi: https://doi.org/10.1109/JBHI.2019.2937827.

Elhassan AT, Aljurf M, Al-Mohanna F, Shoukri M. 2017. Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rRUS) as a data reduction method. Global Journal of Technology & Optimization. vol S1: 1‒11. doi: https://doi.org/10.4172/2229-8711.S1:111.

Ezzat A, Wu M, Li XL, Kwoh CK. 2016. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC Bioinformatics. vol 17(19): 267‒276. doi: https://doi.org/10.1186/s12859-016-1377-y.

Fawcett T. 2004. ROC Graphs: Notes and practical considerations for data mining researchers. Pattern Recognition Letters. vol 31(8): 1‒38.

Harris CW. 1967. Problems in measuring change. Madison: University of Wisconsin Press.

Hong M, Li S, Tan HY, Cheung F, Wang N, Huang J, Feng Y. 2017. A network-based pharmacology study of the herb-induced liver injury potential of traditional hepatoprotective Chinese herbal medicines. Molecules. vol 22(4): 1‒14. doi: https://doi.org/10.3390/molecules22040632.

Hu F, Li H. 2013. A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Mathematical Problems in Engineering. vol 2013: 1‒11. doi: http://dx.doi.org/10.1155/2013/694809.

Kim S, Jin D, Lee H. 2013. Predicting drug-target interactions using drug-drug interactions. PloS One. vol 8(11): 1‒12. doi: https://doi.org/10.1371/journal.pone.0080129.

Kurnia A. 2017. Prediksi formula jamu berkhasiat menggunakan teknik link prediction dari jejaring bipartite senyawa aktif dan protein. [Thesis]. Bogor: IPB University.

Masri VR, Kusuma WA. 2018. Pengujian Usability pada Ijah-Webserver dengan Menggunakan Metode Cognitive Walkthrough. [Skripsi]. Bogor: IPB University.

Mei JP, Kwoh, CK, Yang P, Li XL, Zheng J. 2013. Drug–target interaction prediction by learning from local information and neighbors. Bioinformatics. vol 29(2): 238‒245. doi: https://doi.org/10.1093/bioinformatics/bts670.

Mendez D, Gaulton A., Bento AP, Chambers J, De Veij M, Félix E, Magariños MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Marañón M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR. 2019. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Research. vol 47(1): 930‒940. doi: https://doi.org/10.1093/nar/gky1075.

Mousavian Z, Khakabimamaghani S, Kavousi K, Masoudi-Nejad A. 2016. Drug–target interaction prediction from PSSM based evolutionary information. Journal of Pharmacological and Toxicological Methods. vol 78: 42‒51. doi: https://doi.org/10.1016/j.vascn.2015.11.002.

Sonego P, Kocsor A, Pongor S. 2008. ROC analysis: applications to the classification of biological sequences and 3D structures. Briefings in Bioinformatics. vol 9(3): 198‒209. doi: https://doi.org/10.1093/bib/bbm064.

Tsubaki M, Tomii K, Sese J. 2019. Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics. vol 35(2): 309‒318. doi: https://doi.org/10.1093/bioinformatics/bty535.

Wang Y, Bryant SH, Cheng T, Wang J, Gindulyte A, Shoemaker BA, Thiessen PA, He S, Zhang J. 2017. Pubchem bioassay: 2017 update. Nucleic Acids Research. vol 45(1): 955‒963. doi: https://doi.org/10.1093/nar/gkw1118.

Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. 2008. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. vol 24(13): 232‒240. doi: https://doi.org/10.1093/bioinformatics/btn162.

Zhang ZC, Zhang XF, Wu M, Ou-Yang L, Zhao XM, Li XL. 2020. A graph regularized generalized matrix factorization model for predicting links in biomedical bipartite networks. Bioinformatics. vol 36(11): 3474‒3481. doi: https://doi.org/10.1093/bioinformatics/btaa157.