Performance Comparison of Data Sampling Techniques to Handle Imbalanced Class on Prediction of Compound-Protein Interaction
Abstract
The prediction of Compound-Protein Interactions (CPI) is an essential step in the drug-target analysis for developing new drugs as well as for drug repositioning. One challenging issue in this field is that commonly there are more numbers of non-interacting compound-protein pairs than interacting pairs. This problem causes bias, which may degrade the prediction of CPI. Besides, currently, there is not much research on CPI prediction that compares data sampling techniques to handle the class imbalance problem. To address this issue, we compare four data sampling techniques, namely Random Under-sampling (RUS), Combination of Over-Under-sampling (COUS), Synthetic Minority Over-sampling Technique (SMOTE), and Tomek Link (T-Link). The benchmark CPI data: Nuclear Receptor and G-Protein Coupled Receptor (GPCR) are used to test these techniques. Area Under Curve (AUC) applied to evaluate the CPI prediction performance of each technique. Results show that the AUC values for RUS, COUS, SMOTE, and T-Link are 0.75, 0.77, 0.85 and 0.79 respectively on Nuclear Receptor data and 0.70, 0.85, 0.91 and 0.72 respectively on GPCR data. These results indicate that SMOTE has the highest AUC values. Furthermore, we found that the SMOTE technique is more capable of handling class imbalance problems on CPI prediction compared to the remaining three other techniques.
References
Batista GEDAPA, Monard MC. 2002. A Study of K-Nearest Neighbour as an Imputation Method. His. vol 87: 251‒260.
Bleakley K, Yamanishi Y. 2009. Supervised prediction of drug–target interactions using bipartite local models. Bioinformatics. vol 25(18): 2397‒2403. doi: https://doi.org/10.1093/bioinformatics/btp433.
Buza K, Peška L. 2017. Drug–target interaction prediction with Bipartite Local Models and hubness-aware regression. Neurocomputing. vol 260: 284‒293. doi: https://doi.org/10.1016/j.neucom.2017.04.055.
Chawla NV. 2003. C4.5 and imbalanced data sets: investigating the effect of sampling method, probabilistic estimate, and decision tree structure. In Proceedings of Workshop on Learning from Imbalanced Data Sets (II). August 21, 2003. Washington DC: ICML. vol 3: 66-73.
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. 2002. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. vol 16: 321‒357. doi: https://doi.org/10.1613/jair.953.
Cui Z, Liu JX, Gao YL, Zhu R, Yuan SS. 2019. LncRNA-disease associations prediction using bipartite local model with nearest profile-based association inferring. IEEE Journal of Biomedical and Health Informatics. vol 24(5): 1519‒1527. doi: https://doi.org/10.1109/JBHI.2019.2937827.
Elhassan AT, Aljurf M, Al-Mohanna F, Shoukri M. 2017. Classification of imbalance data using tomek link (t-link) combined with random under-sampling (rRUS) as a data reduction method. Global Journal of Technology & Optimization. vol S1: 1‒11. doi: https://doi.org/10.4172/2229-8711.S1:111.
Ezzat A, Wu M, Li XL, Kwoh CK. 2016. Drug-target interaction prediction via class imbalance-aware ensemble learning. BMC Bioinformatics. vol 17(19): 267‒276. doi: https://doi.org/10.1186/s12859-016-1377-y.
Fawcett T. 2004. ROC Graphs: Notes and practical considerations for data mining researchers. Pattern Recognition Letters. vol 31(8): 1‒38.
Harris CW. 1967. Problems in measuring change. Madison: University of Wisconsin Press.
Hong M, Li S, Tan HY, Cheung F, Wang N, Huang J, Feng Y. 2017. A network-based pharmacology study of the herb-induced liver injury potential of traditional hepatoprotective Chinese herbal medicines. Molecules. vol 22(4): 1‒14. doi: https://doi.org/10.3390/molecules22040632.
Hu F, Li H. 2013. A novel boundary oversampling algorithm based on neighborhood rough set model: NRSBoundary-SMOTE. Mathematical Problems in Engineering. vol 2013: 1‒11. doi: http://dx.doi.org/10.1155/2013/694809.
Kim S, Jin D, Lee H. 2013. Predicting drug-target interactions using drug-drug interactions. PloS One. vol 8(11): 1‒12. doi: https://doi.org/10.1371/journal.pone.0080129.
Kurnia A. 2017. Prediksi formula jamu berkhasiat menggunakan teknik link prediction dari jejaring bipartite senyawa aktif dan protein. [Thesis]. Bogor: IPB University.
Masri VR, Kusuma WA. 2018. Pengujian Usability pada Ijah-Webserver dengan Menggunakan Metode Cognitive Walkthrough. [Skripsi]. Bogor: IPB University.
Mei JP, Kwoh, CK, Yang P, Li XL, Zheng J. 2013. Drug–target interaction prediction by learning from local information and neighbors. Bioinformatics. vol 29(2): 238‒245. doi: https://doi.org/10.1093/bioinformatics/bts670.
Mendez D, Gaulton A., Bento AP, Chambers J, De Veij M, Félix E, Magariños MP, Mosquera JF, Mutowo P, Nowotka M, Gordillo-Marañón M, Hunter F, Junco L, Mugumbate G, Rodriguez-Lopez M, Atkinson F, Bosc N, Radoux CJ, Segura-Cabrera A, Hersey A, Leach AR. 2019. ChEMBL: towards direct deposition of bioassay data. Nucleic Acids Research. vol 47(1): 930‒940. doi: https://doi.org/10.1093/nar/gky1075.
Mousavian Z, Khakabimamaghani S, Kavousi K, Masoudi-Nejad A. 2016. Drug–target interaction prediction from PSSM based evolutionary information. Journal of Pharmacological and Toxicological Methods. vol 78: 42‒51. doi: https://doi.org/10.1016/j.vascn.2015.11.002.
Sonego P, Kocsor A, Pongor S. 2008. ROC analysis: applications to the classification of biological sequences and 3D structures. Briefings in Bioinformatics. vol 9(3): 198‒209. doi: https://doi.org/10.1093/bib/bbm064.
Tsubaki M, Tomii K, Sese J. 2019. Compound–protein interaction prediction with end-to-end learning of neural networks for graphs and sequences. Bioinformatics. vol 35(2): 309‒318. doi: https://doi.org/10.1093/bioinformatics/bty535.
Wang Y, Bryant SH, Cheng T, Wang J, Gindulyte A, Shoemaker BA, Thiessen PA, He S, Zhang J. 2017. Pubchem bioassay: 2017 update. Nucleic Acids Research. vol 45(1): 955‒963. doi: https://doi.org/10.1093/nar/gkw1118.
Yamanishi Y, Araki M, Gutteridge A, Honda W, Kanehisa M. 2008. Prediction of drug–target interaction networks from the integration of chemical and genomic spaces. Bioinformatics. vol 24(13): 232‒240. doi: https://doi.org/10.1093/bioinformatics/btn162.
Zhang ZC, Zhang XF, Wu M, Ou-Yang L, Zhao XM, Li XL. 2020. A graph regularized generalized matrix factorization model for predicting links in biomedical bipartite networks. Bioinformatics. vol 36(11): 3474‒3481. doi: https://doi.org/10.1093/bioinformatics/btaa157.
Copyright (c) 2020 Akhmad Rezki Purnajaya, Wisnu Ananta Kusuma, Medria Kusuma Dewi Hardhienata
This work is licensed under a Creative Commons Attribution 4.0 International License.
COPYRIGHT AND LICENSE STATEMENT
COPYRIGHT
Biogenesis: Jurnal Ilmiah Biologi is published under the terms of the Creative Commons Attribution license. Authors hold the copyright and retain publishing rights without restriction to their work. Users may read, download, copy, distribute, and print the work in any medium, provided the original work is properly cited.
LICENSE TO PUBLISH
1. License
The use of the article will be governed by the Creative Commons Attribution license as currently displayed on http://creativecommons.org/licenses/by/4.0.
2. Author’s Warranties
The author warrants that the article is original, written by stated author/s, has not been published before, contains no unlawful statements, does not infringe the rights of others, is subject to copyright that is vested exclusively in the author and free of any third party rights, and that any necessary written permissions to quote from other sources have been obtained by the author(s).
3. User Rights
Under the Creative Commons Attribution license, the users are free to download, reuse, reprint, modify, distribute and/or copy the content for any purpose, even commercially, as long as the original authors and source are cited. No permission is required from the authors or the publishers.
4. Co-Authorship
If the article was prepared jointly with other authors, the corresponding author warrants that he/she has been authorized by all co-authors, and agrees to inform his/her co-authors of the terms of this statement.
5. Miscellaneous
Biogenesis: Jurnal Ilmiah Biologi may conform the article to a style of punctuation, spelling, capitalization, and usage that it deems appropriate. The author acknowledges that the article may be published so that it will be publicly accessible and such access will be free of charge for the readers.